paper update

maxibor · Aug 30, 2019 · 068358e · 068358e
1 parent 89b5cdf
commit 068358e
Show file tree

Hide file tree

Showing 3 changed files with 57 additions and 51 deletions.
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -1,7 +1,7 @@
 @article{kraken,
   title={Kraken: ultrafast metagenomic sequence classification using exact alignments},
   author={Wood, Derrick E and Salzberg, Steven L},
-  journal={Genome biology},
+  journal={Genome Biology},
   volume={15},
   number={3},
   pages={R46},
@@ -31,9 +31,9 @@ @article{metagenomics
   doi={10.1038/455481a}
 }
 @article{scikit-learn,
-  title={Scikit-learn: Machine learning in Python},
+  title={{Scikit-learn: Machine learning in Python}},
   author={Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others},
-  journal={Journal of machine learning research},
+  journal={Journal of Machine Learning Research},
   volume={12},
   number={Oct},
   pages={2825--2830},
@@ -42,7 +42,7 @@ @article{scikit-learn
 @article{platt,
   title={Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods},
   author={Platt, John and others},
-  journal={Advances in large margin classifiers},
+  journal={Advances in Large Margin Classifiers},
   volume={10},
   number={3},
   pages={61--74},
@@ -52,7 +52,7 @@ @article{platt
 @article{ete3,
   title={ETE 3: reconstruction, analysis, and visualization of phylogenomic data},
   author={Huerta-Cepas, Jaime and Serra, Fran{\c{c}}ois and Bork, Peer},
-  journal={Molecular biology and evolution},
+  journal={Molecular Biology and Evolution},
   volume={33},
   number={6},
   pages={1635--1638},
@@ -72,9 +72,9 @@ @article{wu
   doi={10.1128/AEM.01996-06}
 }
 @article{tsne,
-  title={Visualizing data using t-SNE},
+  title={Visualizing data using {t-SNE}},
   author={Maaten, Laurens van der and Hinton, Geoffrey},
-  journal={Journal of machine learning research},
+  journal={Journal of Machine Learning Research},
   volume={9},
   number={Nov},
   pages={2579--2605},
@@ -92,7 +92,7 @@ @article{sourcetracker
   doi={10.1038/nmeth.1650}
 }
 @article{bray-curtis,
-  title={An ordination of the upland forest communities of southern Wisconsin},
+  title={An ordination of the upland forest communities of {southern Wisconsin}},
   author={Bray, J Roger and Curtis, John T},
   journal={Ecological monographs},
   volume={27},
@@ -102,6 +102,44 @@ @article{bray-curtis
   publisher={Wiley Online Library},
   doi={10.2307/1942268}
 }
+@misc{scikit-bio,
+  author       = {Jai Ram Rideout and
+                  Greg Caporaso and
+                  Evan Bolyen and
+                  Daniel McDonald and
+                  Yoshiki Vázquez Baeza and
+                  Jorge Cañardo Alastuey and
+                  Anders Pitman and
+                  Jamie Morton and
+                  Jose Navas and
+                  Kestrel Gorlick and
+                  Justine Debelius and
+                  Zech Xu and
+                  llcooljohn and
+                  adamrp and
+                  Joshua Shorenstein and
+                  Laurent Luce and
+                  Will Van Treuren and
+                  John Chase and
+                  charudatta-navare and
+                  Colin Brislawn and
+                  Antonio Gonzalez and
+                  Weronika Patena and
+                  Karen Schwarzberg and
+                  teravest and
+                  Jens Reeder and
+                  shiffer1 and
+                  nbresnick and
+                  Kevin Murray and
+                  alexbrc and
+                  Karan Sharma},
+  title        = {{biocore/scikit-bio: scikit-bio 0.5.5: More 
+                   compositional methods added}},
+  month        = dec,
+  year         = 2018,
+  doi          = {10.5281/zenodo.2254379},
+  url          = {https://doi.org/10.5281/zenodo.2254379}
+}
 
 
 

diff --git a/paper/paper.md b/paper/paper.md
@@ -21,12 +21,12 @@ bibliography: paper.bib
 
 # Summary
 
-SourcePredict [(github.com/maxibor/sourcepredict)](https://github.com/maxibor/sourcepredict) is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking.
+SourcePredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking.
 
 DNA shotgun sequencing of human, animal, and environmental samples has opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics [@metagenomics]. One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers, such as Kraken [@kraken].
 
 In cases where the origin of a metagenomic sample, its source, is unknown, it is often part of the research question to predict and/or confirm the source. For example, in microbial archaelogy, it is sometimes necessary to rely on metagenomics to validate the source of paleofaeces.
-Using samples of known sources, a reference dataset can be established with the taxonomic composition of the samples, *i.e.* the organisms identified in the samples as features, and the sources of the samples as class labels.
+Using samples of known sources, a reference dataset can be established with the taxonomic composition of the samples, i.e., the organisms identified in the samples as features, and the sources of the samples as class labels.
 
 With this reference dataset, a machine learning algorithm can be trained to predict the source of unknown samples (sinks) from their taxonomic composition.
 
@@ -38,60 +38,28 @@ However, the Sourcepredict results are more easily interpreted since the samples
 
 Starting with a numerical organism count matrix (samples as columns, organisms as rows, obtained by a taxonomic classifier) of merged references and sinks datasets, samples are first normalized relative to each other, to correct for uneven sequencing depth using the geometric mean of pairwise ratios (GMPR) method (default) [@gmpr].
 
-After normalization, Sourcepredict performs a two-step prediction algorithm. First, it predicts the proportion of unknown sources, *i.e.* which are not represented in the reference dataset. Second it predicts the proportion of each known source of the reference dataset in the sink samples.
+After normalization, Sourcepredict performs a two-step prediction algorithm. First, it predicts the proportion of unknown sources, i.e., which are not represented in the reference dataset. Second, it predicts the proportion of each known source of the reference dataset in the sink samples.
 
 Organisms are represented by their taxonomic identifiers (TAXID).
 
-### Prediction of unknown sources proportion
+### Prediction of the proportion of unknown sources
 
-Let $S_i \in \{S_1, .., S_n\}$ be a sample from the normalized sinks dataset $D_{sink}$, $o_{j}^{\ i} \in \{o_{1}^{\ i},.., o_{n_o^{\ i}}^{\ i}\}$ be an organism in $S_i$, and $n_o^{\ i}$ be the total number of organisms in $S_i$, with $o_{j}^{\ i} \in \mathbb{Z}+$.
+Let $S_i \in \{S_1, .., S_n\}$ be a sample from the normalized sinks dataset $D_{sink}$, $o_{j}^{\ i} \in \{o_{1}^{\ i},.., o_{n_o^{\ i}}^{\ i}\}$  an organism in $S_i$, and $n_o^{\ i}$  the total number of organisms in $S_i$, with $o_{j}^{\ i} \in \mathbb{Z}+$. Let $m$ be the mean number of samples per source in the reference dataset, such that $m = \frac{1}{O}\sum_{i=1}^{O}S_i$. For each $S_i$ sample, I define $||m||$ derivative samples $U_k^{S_i} \in \{U_1^{S_i}, ..,U_{||m||}^{S_i}\}$ to add to the reference dataset to account for the unknown source proportion in a test sample. Separately for each $S_i$, a proportion denoted $\alpha \in [0,1]$ (default = $0.1$) of each $o_{j}^{\ i}$ organism of $S_i$ is added to each $U_k^{S_i}$ sample such that $U_k^{S_i}(o_j^{\ i}) = \alpha \cdot x_{i \ j}$ , where $x_{i \ j}$ is sampled from a Gaussian distribution $\mathcal{N}\big(S_i(o_j^{\ i}), 0.01)$. The $||m||$ $U_k^{S_i}$ samples are then added to the reference dataset $D_{ref}$, and labeled as *unknown*, to create a new reference dataset denoted ${}^{unk}D_{ref}$. To predict the proportion of unknown sources, a Bray-Curtis [@bray-curtis] pairwise dissimilarity matrix of all $S_i$ and $U_k^{S_i}$ samples is computed using scikit-bio [@scikit-bio]. This distance matrix is then embedded in two dimensions (default) with the scikit-bio implementation of PCoA. This sample embedding is divided into three subsets: ${}^{unk}D_{train}$ ($64\%$), ${}^{unk}D_{test}$ ($20\%$), and ${}^{unk}D_{validation}$($16\%$). The scikit-learn [@scikit-learn] implementation of KNN algorithm is then trained on ${}^{unk}D_{train}$, and the training accuracy is computed with ${}^{unk}D_{test}$. This trained KNN model is then corrected for probability estimation of the unknown proportion using the scikit-learn implementation of Platt's scaling method [@platt] with ${}^{unk}D_{validation}$. The proportion of unknown sources in $S_i$, $p_u \in [0,1]$ is then estimated using this trained and corrected KNN model. Ultimately, this process is repeated independently for each sink sample $S_i$ of $D_{sink}$.
 
-Let $m$ be the mean number of samples per source in the reference dataset, such that $m = \frac{1}{O}\sum_{i=1}^{O}S_i$.
+### Prediction of the proportion of known sources
 
-For each $S_i$ sample, I define $||m||$ derivative samples $U_k^{S_i} \in \{U_1^{S_i}, ..,U_{||m||}^{S_i}\}$ to add to the reference dataset to account for the unknown source proportion in a test sample.
+First, only organism TAXIDs corresponding to the species taxonomic level are retained using the ETE toolkit [@ete3]. A weighted Unifrac (default) [@wu] pairwise distance matrix is then computed on the merged and normalized training dataset $D_{ref}$ and test dataset $D_{sink}$ with scikit-bio, using the NCBI taxonomy as a reference tree. This distance matrix is then embedded in two dimensions (default) using the scikit-learn implementation of t-SNE [@tsne]. The 2-dimensional embedding is then split back to training ${}^{tsne}D_{ref}$ and testing dataset ${}^{tsne}D_{sink}$. The KNN algorithm is then trained on the train subset, with a five (default) cross validation to look for the optimum number of K-neighbors.
+The training dataset ${}^{tsne}D_{ref}$ is further divided into three subsets: ${}^{tsne}D_{train}$ ($64\%$), ${}^{tsne}D_{test}$ ($20\%$), and ${}^{tsne}D_{validation}$ ($16\%$). The training accuracy is then computed with ${}^{tsne}D_{test}$. Finally, this second trained KNN model is also corrected for source proportion estimation using the scikit-learn implementation of the Platt's method with ${}^{tsne}D_{validation}$. The proportion $p_{c_s} \in [0,1]$ of each of the $n_s$ sources $c_s \in \{c_{1},\ ..,\ c_{n_s}\}$ in each sample $S_i$ is then estimated using this second trained and corrected KNN model.
 
-Separately for each $S_i$, a proportion denoted $\alpha \in [0,1]$ (default = $0.1$) of each of the $o_{j}^{\ i}$ organism of $S_i$ is added to each $U_k^{S_i}$ samples such that $U_k^{S_i}(o_j^{\ i}) = \alpha \cdot x_{i \ j}$ , where $x_{i \ j}$ is sampled from a Gaussian distribution $\mathcal{N}\big(S_i(o_j^{\ i}), 0.01)$.
+### Combining unknown and source proportions
 
-The $||m||$ $U_k^{S_i}$ samples are then added to the reference dataset $D_{ref}$, and labeled as *unknown*, to create a new reference dataset denoted ${}^{unk}D_{ref}$.
-
-To predict the proportion of unknown sources, a Bray-Curtis [@bray-curtis] pairwise dissimilarity matrix of all $S_i$ and $U_k^{S_i}$ samples is computed using scikit-bio. This distance matrix is then embedded in two dimensions (default) with the scikit-bio implementation of PCoA.
-
-This sample embedding is divided into three subsets: ${}^{unk}D_{train}$ ($64\%$), ${}^{unk}D_{test}$ ($20\%$), and ${}^{unk}D_{validation}$($16\%$).
-
-The scikit-learn implementation of KNN algorithm is then trained on ${}^{unk}D_{train}$, and the training accuracy is computed with ${}^{unk}D_{test}$.
-
-This trained KNN model is then corrected for probability estimation of the unknown proportion using the scikit-learn implementation of Platt's scaling method [@platt] with ${}^{unk}D_{validation}$.
-
-The proportion of unknown sources in $S_i$, $p_u \in [0,1]$ is then estimated using this trained and corrected KNN model.
-
-Ultimately, this process is repeated independantly for each sink sample $S_i$ of $D_{sink}$.
-
-### Prediction of known source proportion
-
-First, only organism TAXIDs corresponding to the species taxonomic level are retained using the ETE toolkit [@ete3].
-A weighted Unifrac (default) [@wu] pairwise distance matrix is then computed on the merged and normalized training dataset $D_{ref}$ and test dataset $D_{sink}$ with scikit-bio, using the NCBI taxonomy as a reference tree.
-
-This distance matrix is then embedded in two dimensions (default) using the scikit-learn implementation of t-SNE [@tsne].
-
-The 2-dimensional embedding is then split back to training ${}^{tsne}D_{ref}$ and testing dataset ${}^{tsne}D_{sink}$.
-
-The KNN algorithm is then trained on the train subset, with a five (default) cross validation to look for the optimum number of K-neighbors.
-The training dataset ${}^{tsne}D_{ref}$ is further divided into three subsets: ${}^{tsne}D_{train}$ ($64\%$), ${}^{tsne}D_{test}$ ($20\%$), and ${}^{tsne}D_{validation}$ ($16\%$).
-
-The training accuracy is then computed with ${}^{tsne}D_{test}$.
-Finally, this second trained KNN model is also corrected for source proportion estimation using the scikit-learn implementation of the Platt's method with ${}^{tsne}D_{validation}$.
-
-The proportion $p_{c_s} \in [0,1]$ of each of the $n_s$ sources $c_s \in \{c_{1},\ ..,\ c_{n_s}\}$ in each sample $S_i$ is then estimated using this second trained and corrected KNN model.
-
-### Combining unknown and source proportion
-
-Then for each sample $S_i$ of the test dataset $D_{sink}$, the predicted unknown proportion $p_{u}$ is then combined with the predicted proportion $p_{c_s}$ for each of the $n_s$ sources $c_s$ of the training dataset such that $\sum_{c_s=1}^{n_s} s_c + p_u = 1$ where $s_c = p_{c_s} \cdot p_u$.
+For each sample $S_i$ of the test dataset $D_{sink}$, the predicted unknown proportion $p_{u}$ is then combined with the predicted proportion $p_{c_s}$ for each of the $n_s$ sources $c_s$ of the training dataset such that $\sum_{c_s=1}^{n_s} s_c + p_u = 1$ where $s_c = p_{c_s} \cdot p_u$.
 
 Finally, a summary table gathering the estimated sources proportions is returned as a `csv` file, as well as the t-SNE embedding sample coordinates.
 
 ## Acknowledgements
 
-Thanks to Dr. Christina Warinner, Dr. Alexander Herbig, Dr. AB Rohrlach, and Alexander Hübner for their valuable comments and for proofreading this manuscript.
+Thanks to Dr.\ Christina Warinner, Dr.\ Alexander Herbig, Dr.\ AB Rohrlach, and Alexander Hübner for their valuable comments and for proofreading this manuscript.
 This work was funded by the Max Planck Society and the Deutsche Forschungsgemeinschaft, project code: EXC 2051 #390713860.
 
 # References
diff --git a/paper/paper.pdf b/paper/paper.pdf