Skip to content

Prediction/source tracking of metagenomic samples source using machine learning


Notifications You must be signed in to change notification settings


Repository files navigation

Build Status Coverage Status Anaconda-Server Badge Documentation Status DOI DOI

Sourcepredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking. Sourcepredict solves this problem by using machine learning classification on dimensionally reduced datasets.


With conda (recommended)

$ conda install -c conda-forge -c maxibor sourcepredict

With pip

$ pip install sourcepredict




$ wget -O dog_example.csv
$ wget -O sp_labels.csv
$ wget -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Step 1: Checking for unknown proportion
  == Sample: ERR1915662 ==
	Adding unknown
	Normalizing (GMPR)
	Computing Bray-Curtis distance
	Performing MDS embedding in 2 dimensions
	KNN machine learning
	Training KNN classifier on 2 cores...
	-> Testing Accuracy: 1.0
	- Sample: ERR1915662
Step 2: Checking for source proportion
	Computing weighted_unifrac distance on species rank
	TSNE embedding in 2 dimensions
	KNN machine learning
	Performing 5 fold cross validation on 2 cores...
	Trained KNN classifier with 10 neighbors
	-> Testing Accuracy: 0.99
	- Sample: ERR1915662
Sourcepredict result written to dog_test_sample.sourcepredict.csv


Sourcepredict output the predicted source contribution to each sink sample, and the embedding of all samples in the lower dimensional space. See documentation for details.


Depending on the normalization method (-n), the embedding (-me) method, the cpus available for parallel processing (-t), and the data, the runtime should be between a few seconds and a few minutes per sink sample.


The documentation of SourcePredict is available here:

Sourcepredict example files

Environments included in the example source file

  • Homo sapiens gut microbiome (1, 2, 3, 4, 5, 6)
  • Canis familiaris gut microbiome (1)
  • Soil microbiome (1, 2, 3)

Contributing Code, Documentation, or Feedback

If you wish to contribute to Sourcepredict, you are welcome and encouraged to contribute by opening an issue, or creating a pull-request. All contributions will be made under the GPLv3 license. More informations can found on the contributing page.

How to cite

Sourcepredict has been published in JOSS.

	journal = {Journal of Open Source Software},
	doi = {10.21105/joss.01540},
	issn = {2475-9066},
	number = {41},
	publisher = {The Open Journal},
	title = {Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification},
	url = {},
	volume = {4},
	author = {Borry, Maxime},
	pages = {1540},
	date = {2019-09-04},
	year = {2019},
	month = {9},
	day = {4}