Fungi Identify the Geographic Origin of Dust Samples
This repository provides
R code to conduct spatial source prediction of dust samples relying solely on their dust-associated fungal communities. These methods mark a new approach to forensic biology that could be used by scientists to identify the origin of dust or soil samples found on objects, clothing, or archaeological artifacts.
For more information, please see our associated publication:
- Fork or clone this repository onto your computer.
Rand set the working directory to this directory. e.g.,
get-data.Rto download the data from the 1000homes figshare repository (thanks Albert!) and munge it into
Note: You only need to run
get-data.R once to download the files locally. This fills the subdirectory
S.csv (lon, lat coordinates of each home),
X.csv (covariate info for each home), and
Y.csv (presence/absence of taxa per home). A further subdirectory
raw is created that holds the pre-munged
fa data files.
To set up one's working environment, run
set-workspace.R. This file loads pertinent
R packages, sources user-defined functions in
functions.R, and loads and (slightly) reformats the
csv data files in
Plot estimated fungi occurrence probabilities
Produce taxon-specific "hot spot" maps via kernel smoothing using
Demonstrate the model
demonstrate-model.R showcases the statistical analysis using a small subset of the taxa occurrence data over a single fold of the cross-validation.
Note: The purpose of this file is to demonstrate the steps behind our predictions in a computationally feasible manner. Unsurprisingly, the predictions produced by operating on the full data in
cross-validate.R are much better than those produced here.
Replicate full analysis
The full analysis is conducted by
cross-validate.R. It is recommended that this file be run on a server with many cores available. Make sure to set the number of available cores
ncore. With the current size of the data (n = 1331 samples, m = 57304 fungi taxa), five-fold cross-validation across 10 cores required nearly 5 hours to complete. (Note: individual folds are not run in parallel; rather, the species are split into
ncore many groups to ensure the size of
M, the kernel smoothed matrix of estimated occurrence probabilities, and
llike, the log-likelihood values, are not prohibitively large.)
This file produces
Tgrid, a matrix of prediction points, and a list
results of length
nfold. Each element of
pmf.test2, probability mass function values over
Tgridfor the locations relegated to the
Stest2, the true origin of the samples in
Stest2.hat, the predicted geographic origin of sample in
results.RData is produced,
analyze-predictions.R loads and analyzes the predictions of the statistical model overall and across several covariates.
Questions or comments?
We would love to hear from you. If you wish to speak about the motivation, scope, and direction of this project, consider contacting our corresponding author Robert R. Dunn (
For questions regarding the specifics of the code provided here, please contact Neal S. Grantham (
firstname.lastname@example.org). If you would instead like to discuss the molecular sequencing methods and data provided at 1000homes figshare repository, please contact Albert Barberán (