Skip to content

Commit

Permalink
update joss paper
Browse files Browse the repository at this point in the history
  • Loading branch information
maxibor committed May 3, 2019
1 parent 8470241 commit 83a45da
Show file tree
Hide file tree
Showing 2 changed files with 134 additions and 10 deletions.
77 changes: 76 additions & 1 deletion paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,80 @@ @article{kraken
number={3},
pages={R46},
year={2014},
publisher={BioMed Central}
publisher={BioMed Central},
doi={10.1186/gb-2014-15-3-r46}
}
@article{gmpr,
title={GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data},
author={Chen, Li and Reeve, James and Zhang, Lujun and Huang, Shengbing and Wang, Xuefeng and Chen, Jun},
journal={PeerJ},
volume={6},
pages={e4600},
year={2018},
publisher={PeerJ Inc.},
doi={10.7717/peerj.4600}
}
@article{metagenomics,
title={Microbiology: metagenomics},
author={Hugenholtz, Philip and Tyson, Gene W},
journal={Nature},
volume={455},
number={7212},
pages={481},
year={2008},
publisher={Nature Publishing Group},
doi={10.1038/455481a}
}
@article{scikit-learn,
title={Scikit-learn: Machine learning in Python},
author={Pedregosa, Fabian and Varoquaux, Ga{\"e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and others},
journal={Journal of machine learning research},
volume={12},
number={Oct},
pages={2825--2830},
year={2011},
}
@article{platt,
title={Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods},
author={Platt, John and others},
journal={Advances in large margin classifiers},
volume={10},
number={3},
pages={61--74},
year={1999},
publisher={Cambridge, MA}
}
@article{ete3,
title={ETE 3: reconstruction, analysis, and visualization of phylogenomic data},
author={Huerta-Cepas, Jaime and Serra, Fran{\c{c}}ois and Bork, Peer},
journal={Molecular biology and evolution},
volume={33},
number={6},
pages={1635--1638},
year={2016},
publisher={Society for Molecular Biology and Evolution},
doi={10.1093/molbev/msw046}
}
@article{wu,
title={Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities},
author={Lozupone, Catherine A and Hamady, Micah and Kelley, Scott T and Knight, Rob},
journal={Appl. Environ. Microbiol.},
volume={73},
number={5},
pages={1576--1585},
year={2007},
publisher={Am Soc Microbiol},
doi={10.1128/AEM.01996-06}
}
@article{tsne,
title={Visualizing data using t-SNE},
author={Maaten, Laurens van der and Hinton, Geoffrey},
journal={Journal of machine learning research},
volume={9},
number={Nov},
pages={2579--2605},
year={2008}
}



67 changes: 58 additions & 9 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,70 @@ authors:
orcid: 0000-0001-9140-7559
affiliation: "1"
affiliations:
- name: Department of Archaegenetics, Max Planck Institute for the Science of Human Histoy, Jena, 07745, Germany
- name: Department of Archaeogenetics, Max Planck Institute for the Science of Human History, Jena, 07745, Germany
index: 1
date: 2 May 2019
date: 3rd May 2019
bibliography: paper.bib
---

# Summary

SourcePredict is a Python package to classify and predict the source of metagenomics sample given a training set.
The DNA shotgun sequencing of human, animal, and environmental samples opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics [ADD_REF].
One of the goal of metagenomics is to look at the composition of a sequencing sample with tools known as taxonomic classifiers.
These taxonomic classifiers, such as Kraken [@kraken] for example, will compute the organism composition from sequencing data.
SourcePredict is a Python package to classify and predict the source of metagenomics sample given a training set.

When in most cases the origin of a metagenomics sample is known, it is sometimes part of the research question to infer and/or confirm its origin.
For samples of known origin, a training set can be established with the sample composition as data, and the origin of the sample as labels.
Using this training set, a machine learning algorithm can the predict the origin of unlabeled samples from their composition.
The DNA shotgun sequencing of human, animal, and environmental samples opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics [@metagenomics].
One of the goals of metagenomics is to look at the composition of a sequencing sample with tools known as taxonomic classifiers.
These taxonomic classifiers, such as Kraken [@kraken] for example, will compute the taxonomic composition in Operational Taxonomic Unit (OTU), from the DNA sequencing data.

When in most cases the origin of a metagenomic sample is known, it is sometimes part of the research question to infer and/or confirm its source.
Using samples of known sources, a training set can be established with the OTU sample composition as features, and the source of the sample as class labels.
With this training set, a machine learning algorithm can be trained to predict the source of unlabeled samples from their OTU taxonomic composition.

Here, I developed SourcePredict to perform the classification/prediction of unlabeled samples sources from their OTU taxonomic compositions.

## Method

All samples are first normalized to correct for uneven sequencing depth using GMPR (default) [@gmpr].
After normalization, Sourcepredict performs a two steps prediction.

### Prediction of unknown sources proportion

The unknown sources proportion is the proportion of OTUs in the test sample which are not present in the training dataset.

Let $S$ be a sample of size $O$ with $O$ OTUs from the test dataset $D_{test}$
Let $n$ be the average number of samples per class in the training dataset.
Let $U_n$ be the samples to add to the training dataset to account for the unknown source proportion in a test sample.

First a $\alpha$ proportion (default=$0.1$) of each $o_i$ OTU (with $i\in[1,O]$) is added to the training dataset for each $U_j$ samples (with $j\in[1,n]$), such as $U_j(o_i) = \alpha\times S_(o_i)$

The $U_n$ samples are then merged as columns to the training dataset ($D_{train}$) to create a new training dataset denoted $D_{train\ unknown}$

To predict this unknown proportion, the dimension of the training dataset $D_{train\ unknown}$ (samples in columns, OTUs as rows) is first reduced to 20 with the scikit-learn [@scikit-learn] implementation of the PCA.
This training dataset is further divided into three subsets: train (64%), test (20%), and validation (16%).
The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm is then trained on the train subset, and the test accuracy is computed with the test subset.
The trained KNN model is then corrected for probability estimation of unknown proportion using the scikit-learn implementation of the Platt's scaling method [@platt] with the validation subset.
This procedure is repeated for each sample of the test dataset.

### Prediction of known source proportion

First, only OTUs corresponding to the *species* taxonomic level are kept using ETE toolkit [@ete3].
A distance matrix is then computed on the merged training dataset $D_{train}$ and test dataset $D_{test}$ using the scikit-bio implementation of weighted Unifrac distance (default) [@wu].

The distance matrix is then embedded in two dimensions using the scikit-learn implementation of t-SNE [@tsne].

The 2-dimensional embedding is then split back to training and testing dataset.

The training dataset is further divided into three subsets: train (64%), test (20%), and validation (16%).
The scikit-learn implementation of K-Nearest-Neighbors (KNN) algorithm is then trained on the train subset, and the test accuracy is computed with the test subset.
The trained KNN model is then corrected for source proportion estimation using the scikit-learn implementation of the Platt's method with the validation subset.

### Combining unknown and source proportion

For each sample, the predicted unknown proportion $p\_{unknown}$ is then combined with the predicted proportion of each of the $C$ source class $c$ of the training dataset such as:

$$\sum_{c=1}^{C} s_c + p_{unknown} = 1$$

with

$$s_c = s_{c\ predicted}\times p_{unknown}$$

# References

0 comments on commit 83a45da

Please sign in to comment.