Skip to content

Commit

Permalink
paper update
Browse files Browse the repository at this point in the history
  • Loading branch information
maxibor committed May 17, 2019
1 parent 6a47e5f commit 5e95dbb
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 13 deletions.
2 changes: 1 addition & 1 deletion paper/codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"datePublished": "2019-05-02",
"dateModified": "2019-05-02",
"dateCreated": "2019-05-02",
"description": "Prediction/source tracking of metagenomic samples source using machine learning",
"description": "Sourcepredict: Prediction/source tracking of metagenomic sample sources using machine learning",
"keywords": "microbiome, sourcetracking, machine learning",
"license": "GPL v3.0",
"title": "sourcepredict",
Expand Down
18 changes: 6 additions & 12 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,15 @@ bibliography: paper.bib

# Summary

SourcePredict [(github.com/maxibor/sourcepredict)](https://github.com/maxibor/sourcepredict) is a Python package to classify and predict the source of metagenomics sample given a training set.
SourcePredict [(github.com/maxibor/sourcepredict)](https://github.com/maxibor/sourcepredict) is a Python Conda package to classify and predict the source of metagenomics sample given a reference dataset of known sources.

DNA shotgun sequencing of human, animal, and environmental samples opened up new doors to explore the diversity of life in these different environments, a field known as metagenomics [@metagenomics].
One of the aspect of metagenomics is to look at the organism composition in a sequencing sample, with tools known as taxonomic classifiers.
These taxonomic classifiers, such as Kraken [@kraken] for example, will compute the organism taxonomic composition, from the DNA sequencing data.

When in most cases the origin (source) of a metagenomic sample is known, it is sometimes part of the research question to infer and/or confirm its source.
Using samples of known sources, a reference dataset can be established with the samples organism taxonomic composition as features, and the source of the sample as class labels.
With this reference dataset, a machine learning algorithm can be trained to predict the source of unlabeled samples from their organism taxonomic composition.
Using samples of known sources, a reference dataset can be established with the samples taxonomic composition (the organisms identified in the sample) as features, and the source of the sample as class labels.
With this reference dataset, a machine learning algorithm can be trained to predict the source of unlabeled samples from their taxonomic composition.
Compared to SourceTracker [@sourcetracker], which uses gibbs sampling, Sourcepredict uses dimension reduction algorithms, followed by K-Nearest-Neighbors (KNN) classification.

Here, I present SourcePredict for the classification/prediction of unlabeled sample sources from their taxonomic compositions.
Expand All @@ -43,12 +43,12 @@ Let $S$ be a sample of size $O$ organims from the test dataset $D_{sink}$
Let $n$ be the average number of samples per class in the reference dataset.
I define $U_n$ samples to add to the training dataset to account for the unknown source proportion in a test sample.

To compute $U_n$, a $\alpha$ proportion (default = $0.1$) of each $o_i$ organism (with $i\in[1,O]$) is added to the training dataset for each $U_j$ samples (with $j\in[1,n]$), such as $U_j(o_i) = \alpha\times S_(o_i)$
To compute $U_n$, a $\alpha$ proportion (default = $0.1$) of each $o_i$ organism (with $i\in[1,O]$) is added to the training dataset for each $U_j$ samples (with $j\in[1,n]$), such as $U_j(o_i) = \alpha\times S(o_i)$

The $U_n$ samples are then merged as columns to the reference dataset ($D_{ref}$) to create a new reference dataset denoted $D_{ref\ unknown}$

To predict this unknown proportion, the dimension of the reference dataset $D_{ref\ unknown}$ (samples in columns, organisms as rows) is first reduced to 20 with the scikit-learn [@scikit-learn] implementation of PCA.
This reference dataset is then divided into three subsets: $D_{train\ unknown}$ (64%), $D_{test\ unknown}$ (20%), and $D_{validation unknown}$(16%).
This reference dataset is then divided into three subsets: $D_{train\ unknown}$ (64%), $D_{test\ unknown}$ (20%), and $D_{validation\ unknown}$(16%).

The scikit-learn implementation of KNN algorithm is then trained on $D_{train\ unknown}$, and the test accuracy is computed with $D_{test\ unknown}$ .
The trained KNN model is then corrected for probability estimation of unknown proportion using the scikit-learn implementation of the Platt's scaling method [@platt] with $D_{validation\ unknown}$.
Expand Down Expand Up @@ -81,13 +81,7 @@ with

$$s_c = p_{c}\times p_{unknown}$$

## Command Line Interface

The SourcePredict CLI is handled with argparse. A typical command to use SourcePredict is as simple as:

`sourcepredict path/to/test_otu_table.csv`

The documentation of CLI is available at [sourcepredict.readthedocs.io](https://sourcepredict.readthedocs.io)
Finally, a summary table is created to gather the estimated sources proportions.

## Acknowledgements

Expand Down
Binary file modified paper/paper.pdf
Binary file not shown.

0 comments on commit 5e95dbb

Please sign in to comment.