update joss paper

maxibor · May 3, 2019 · 83a45da · 83a45da
1 parent 8470241
commit 83a45da
Show file tree

Hide file tree

Showing 2 changed files with 134 additions and 10 deletions.
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -6,5 +6,80 @@ @article{kraken
   number={3},
   pages={R46},
   year={2014},
-  publisher={BioMed Central}
+  publisher={BioMed Central},
+  doi={10.1186/gb-2014-15-3-r46}
 }
+@article{gmpr,
+  title={GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data},
+  author={Chen, Li and Reeve, James and Zhang, Lujun and Huang, Shengbing and Wang, Xuefeng and Chen, Jun},
+  journal={PeerJ},
+  volume={6},
+  pages={e4600},
+  year={2018},
+  publisher={PeerJ Inc.},
+  doi={10.7717/peerj.4600}
+}
+@article{metagenomics,
+  title={Microbiology: metagenomics},
+  author={Hugenholtz, Philip and Tyson, Gene W},
+  journal={Nature},
+  volume={455},
+  number={7212},
+  pages={481},
+  year={2008},
+  publisher={Nature Publishing Group},
+  doi={10.1038/455481a}
+}
+@article{scikit-learn,
+  title={Scikit-learn: Machine learning in Python},
+  author={Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others},
+  journal={Journal of machine learning research},
+  volume={12},
+  number={Oct},
+  pages={2825--2830},
+  year={2011},
+}
+@article{platt,
+  title={Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods},
+  author={Platt, John and others},
+  journal={Advances in large margin classifiers},
+  volume={10},
+  number={3},
+  pages={61--74},
+  year={1999},
+  publisher={Cambridge, MA}
+}
+@article{ete3,
+  title={ETE 3: reconstruction, analysis, and visualization of phylogenomic data},
+  author={Huerta-Cepas, Jaime and Serra, Fran{\c{c}}ois and Bork, Peer},
+  journal={Molecular biology and evolution},
+  volume={33},
+  number={6},
+  pages={1635--1638},
+  year={2016},
+  publisher={Society for Molecular Biology and Evolution},
+  doi={10.1093/molbev/msw046}
+}
+@article{wu,
+  title={Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities},
+  author={Lozupone, Catherine A and Hamady, Micah and Kelley, Scott T and Knight, Rob},
+  journal={Appl. Environ. Microbiol.},
+  volume={73},
+  number={5},
+  pages={1576--1585},
+  year={2007},
+  publisher={Am Soc Microbiol},
+  doi={10.1128/AEM.01996-06}
+}
+@article{tsne,
+  title={Visualizing data using t-SNE},
+  author={Maaten, Laurens van der and Hinton, Geoffrey},
+  journal={Journal of machine learning research},
+  volume={9},
+  number={Nov},
+  pages={2579--2605},
+  year={2008}
+}
+
+
+
diff --git a/paper/paper.md b/paper/paper.md
@@ -9,21 +9,70 @@ authors:
    orcid: 0000-0001-9140-7559
    affiliation: "1"
 affiliations:
- - name: Department of Archaegenetics, Max Planck Institute for the Science of Human Histoy, Jena, 07745, Germany
+ - name: Department of Archaeogenetics, Max Planck Institute for the Science of Human History, Jena, 07745, Germany
    index: 1
-date: 2 May 2019
+date: 3rd May 2019
 bibliography: paper.bib
 ---
 
 # Summary
 
-SourcePredict is a Python package to classify and predict the source of metagenomics sample given a training set.
-The DNA shotgun sequencing of human, animal, and environmental samples opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics [ADD_REF].
-One of the goal of metagenomics is to look at the composition of a sequencing sample with tools known as taxonomic classifiers.
-These taxonomic classifiers, such as Kraken [@kraken] for example, will compute the organism composition from sequencing data.
+SourcePredict is a Python package to classify and predict the source of metagenomics sample given a training set.  
 
-When in most cases the origin of a metagenomics sample is known, it is sometimes part of the research question to infer and/or confirm its origin.
-For samples of known origin, a training set can be established with the sample composition as data, and the origin of the sample as labels.
-Using this training set, a machine learning algorithm can the predict the origin of unlabeled samples from their composition.
+The DNA shotgun sequencing of human, animal, and environmental samples opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics [@metagenomics].  
+One of the goals of metagenomics is to look at the composition of a sequencing sample with tools known as taxonomic classifiers.
+These taxonomic classifiers, such as Kraken [@kraken] for example, will compute the taxonomic composition in Operational Taxonomic Unit (OTU), from the DNA sequencing data.
+
+When in most cases the origin of a metagenomic sample is known, it is sometimes part of the research question to infer and/or confirm its source.  
+Using samples of known sources, a training set can be established with the OTU sample composition as features, and the source of the sample as class labels.  
+With this training set, a machine learning algorithm can be trained to predict the source of unlabeled samples from their OTU taxonomic composition.
+
+Here, I developed SourcePredict to perform the classification/prediction of unlabeled samples sources from their OTU taxonomic compositions.
+
+## Method
+
+All samples are first normalized to correct for uneven sequencing depth using GMPR (default) [@gmpr].
+After normalization, Sourcepredict performs a two steps prediction.
+
+### Prediction of unknown sources proportion
+
+The unknown sources proportion is the proportion of OTUs in the test sample which are not present in the training dataset.  
+
+Let $S$ be a sample of size $O$ with $O$ OTUs from the test dataset $D_{test}$  
+Let $n$ be the average number of samples per class in the training dataset.  
+Let $U_n$ be the samples to add to the training dataset to account for the unknown source proportion in a test sample.  
+
+First a $\alpha$ proportion (default=$0.1$) of each $o_i$ OTU (with $i\in[1,O]$) is added to the training dataset for each $U_j$ samples (with $j\in[1,n]$), such as $U_j(o_i) = \alpha\times S_(o_i)$  
+
+The $U_n$ samples are then merged as columns to the training dataset ($D_{train}$) to create a new training dataset denoted $D_{train\ unknown}$
+
+To predict this unknown proportion, the dimension of the training dataset $D_{train\ unknown}$ (samples in columns, OTUs as rows) is first reduced to 20 with the scikit-learn [@scikit-learn] implementation of the PCA.  
+This training dataset is further divided into three subsets: train (64%), test (20%), and validation (16%).  
+The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm is then trained on the train subset, and the test accuracy is computed with the test subset.  
+The trained KNN model is then corrected for probability estimation of unknown proportion using the scikit-learn implementation of the Platt's scaling method [@platt] with the validation subset.
+This procedure is repeated for each sample of the test dataset.
+
+### Prediction of known source proportion
+
+First, only OTUs corresponding to the *species* taxonomic level are kept using ETE toolkit [@ete3].
+A distance matrix is then computed on the merged training dataset $D_{train}$ and test dataset $D_{test}$ using the scikit-bio implementation of weighted Unifrac distance (default) [@wu].
+
+The distance matrix is then embedded in two dimensions using the scikit-learn implementation of t-SNE [@tsne].
+
+The 2-dimensional embedding is then split back to training and testing dataset.
+
+The training dataset is further divided into three subsets: train (64%), test (20%), and validation (16%).  
+The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm is then trained on the train subset, and the test accuracy is computed with the test subset.  
+The trained KNN model is then corrected for source proportion estimation using the scikit-learn implementation of the Platt's method with the validation subset.
+
+### Combining unknown and source proportion
+
+For each sample, the predicted unknown proportion $p\_{unknown}$ is then combined with the predicted proportion of each of the $C$ source class $c$ of the training dataset such as:
+
+$$\sum_{c=1}^{C} s_c + p_{unknown} = 1$$
+
+with  
+
+$$s_c = s_{c\ predicted}\times p_{unknown}$$
 
 # References