This notebook contains an example of using `redbiom` through it's Python API to extract a subset of American Gut Project samples. These data are then loaded into QIIME 2 for a mini beta-diversity analysis using UniFrac. This assumes we're using a QIIME 2 2018.11 environment that additionally has `redbiom` 0.3.0 installed. The exact commands I ran to install it are:

```
$ conda install nltk
$ pip install https://github.com/biocore/redbiom/archive/0.3.0.zip
```

In [1]:
import redbiom.summarize
import redbiom.search
import redbiom.fetch
import qiime2
import pandas as pd
pd.options.display.max_colwidth = 1000

The first thing we're going to do is gather the `redbiom` contexts. A context is roughly a set of consistent technical parameters. For example, the specific sequenced gene, the variable region within the gene, the length of the read, and how the operational taxonomic units were assessed.

The reason `redbiom` partitions data into contexts is because these technical details can lead to massive technical bias. The intention is to facilitate comparing "apples" to "apples". 

The context summarization returns a pandas `DataFrame` so it should be pretty friendly to manipulate.

In [2]:
contexts = redbiom.summarize.contexts()

In [3]:
contexts.shape

(104, 4)

In [4]:
contexts.head(2)

Unnamed: 0,ContextName,SamplesWithData,FeaturesWithData,Description
0,Pick_closed-reference_OTUs-Greengenes-ls454-16S-v6-150nt-bd7d4d,114,3115,Qiita context
1,Pick_closed-reference_OTUs-Greengenes-flx-16S-v4-150nt-bd7d4d,116,4218,Qiita context


At the present time, determining the context to use is a bit manual and requires some strung munging. Additional development is needed.

Let's take a look at the larger contexts.

In [5]:
contexts.sort_values('SamplesWithData', ascending=False).head()

Unnamed: 0,ContextName,SamplesWithData,FeaturesWithData,Description
38,Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-100nt-a243a1,129596,74983,Qiita context
6,Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-5c6506,128222,82492,Qiita context
84,Pick_closed-reference_OTUs-Greengenes-illumina-16S-v4-90nt-44feac,125354,73083,Qiita context
48,Deblur-NA-illumina-16S-v4-100nt-fbc5b2,123127,5587560,Qiita context
18,Deblur-NA-illumina-16S-v4-90nt-99d1d8,119538,4460311,Qiita context


For simplicity sake, let's take the first context. It's large, and the phylogeny associated with the operational taxonomic units is easy to get. We'll break down the meaning of the context name in a moment. In practice, you will _most likely_ want to use the Deblur data, however producing a reasonable tree from those data requies a slightly computationally expensive step, and I'm on my laptop right now with limited battery quite literally in the middle of nowhere on a bus in the Czech Republic.

## Using Deblur Context

In [6]:
ctx = contexts.sort_values('SamplesWithData', ascending=False).iloc[3]['ContextName']

In [7]:
ctx

'Deblur-NA-illumina-16S-v4-100nt-fbc5b2'

**for Daniel Notes below talks about Greengenes but we are using deblur as he recommends instaed**

Breaking this name into its constiuent pieces, this is a closed reference context meaning that operational taxonomic units were assessed against a reference database and sequences which did not recruit to the reference were discarded. The reference used is Greengenes, a common 16S reference database. The gene represented by the data is the 16S SSU rRNA gene, and specifically the V4 region of the gene. Finally, the fragments represented are truncated to 100 nucleotides. (Don't worry if this is all a lot of jargon. It is a lot of jargon. Please ask questions :)

So cool, we have a "context". What can we do now? Let's search for some sample identifiers based off of the metadata (i.e., variables) associated with the samples. Specifically, let's get some skin, oral and fecal samples. Be forewarned, the metadata search uses Python's `ast` module behind the scenes, so malformed queries at present produce tracebacks.

In [8]:
study_id = 10317  # the Qiita study ID of the American Gut Project is 10317

query = "where qiita_study_id==%d" % (study_id)
results = redbiom.search.metadata_full(query)

In [9]:
len(results)

21506

In [10]:
study_id = 10317  # the Qiita study ID of the American Gut Project is 10317
results = {}
for site in ['sebum', 'saliva', 'feces']:
    query = "where qiita_study_id==%d and env_material=='%s'" % (study_id, site)
    results[site] = redbiom.search.metadata_full(query)

In [11]:
for k, v in results.items():
    print(k, len(v))

feces 16207
saliva 1257
sebum 1136


## Want to get metadata for all samples and then export to csv

In [12]:
to_keep_all = []
for k, v in results.items():
    to_keep_all.extend(list(v))

In [13]:
to_keep = to_keep_all[:5]

In [14]:
md_all, _ = redbiom.fetch.sample_metadata(to_keep, context=ctx)

In [15]:
md_all.shape

(5, 424)

In [16]:
md_all

Unnamed: 0,#SampleID,acid_reflux,acne_medication,acne_medication_otc,add_adhd,age_cat,age_corrected,age_years,alcohol_consumption,alcohol_types_beercider,...,vioscreen_wgrain,vioscreen_whole_grain_servings,vioscreen_xylitol,vioscreen_zinc,vitamin_b_supplement_frequency,vitamin_d_supplement_frequency,vivid_dreams,weight_change,weight_kg,weight_units
26674_10317.000015596,10317.000015596.26674,Not provided,False,False,Not provided,Not provided,Not provided,Not provided,False,False,...,Not provided,Not provided,Not provided,Not provided,Not provided,Not provided,Not provided,Remained stable,77.0,kilograms
26924_10317.000013965,10317.000013965.26924,Not provided,False,False,Not provided,30s,39.0,39.0,True,False,...,Not provided,Not provided,Not provided,Not provided,Not provided,Not provided,Not provided,Remained stable,58.0,kilograms
27978_10317.000007729,10317.000007729.27978,Not provided,False,False,Not provided,40s,40.0,40.0,True,False,...,Not provided,Not provided,Not provided,Not provided,Not provided,Not provided,Not provided,Remained stable,77.0,kilograms
28264_10317.000029493,10317.000029493.28264,Not provided,False,False,I do not have this condition,60s,66.0,66.0,True,False,...,Not provided,Not provided,Not provided,Not provided,Never,Never,Not provided,Increased more than 10 pounds,76.0,kilograms
33048_10317.000041786,10317.000041786.33048,I do not have this condition,False,False,I do not have this condition,30s,34.0,34,True,True,...,Not provided,Not provided,Not provided,Not provided,Never,Never,Never,Remained stable,74.0,kilograms


The last output cell shows what these IDs look like. These are Qiita sample IDs.

Now that we have some samples, let's get some data! What we're going to do is ask `redbiom` to obtain the sample data, for our `to_keep` samples, from the context we previously selected. What's happening behind the scenes is that the API is pulling out sparse vectors corresponding to the number of individual sequences observed for each operational taxonomic unit per sample, and additionally unmunging the names (as `redbiom` normalizes sample and feature identifiers). The output is then aggregated into what's called a BIOM `Table`, which is really just a rich object wrapped around a `scipy.sparse` matrix. 

You may noice two outputs on the return. The one we're ignoring represents "ambiguous" samples. Some sample identifiers are associated with multiple sequenced samples. This is because some samples may "fail" sequencing, where they didn't yield sufficient sequence data, and were rerun. These "failures" are still represented in Qiita, but are differentiated by the actual sequencing run they were on. This doesn't matter for the moment though.

## Get the biom_table (OTU data) from deblur context

the following takes a while to run (!30 minutes)

In [18]:
biom_table, _ = redbiom.fetch.data_from_samples(ctx, to_keep)

In [19]:
biom_table

590 x 5 <class 'biom.table.Table'> with 850 nonzero entries (28% dense)

In [20]:
print(biom_table.head(5))

# Constructed from biom file
#OTU ID	10317.000029493.28264	10317.000007729.27978	10317.000015596.26674	10317.000013965.26924	10317.000041786.33048
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTTTTAAGTCAGCGGTGAAAGTCTGTGGCTCAACCATAGAATTG	193.0	79.0	162.0	250.0	24.0
AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGACCGGCAAGTTGGAAGTGAAAACTATGGGCTCAACCTGGGAACTG	0.0	2.0	0.0	0.0	0.0
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGAGCGTAGACGGCAAGGCAAGTCTGAAGTGGAAGCCCGGTGCTTAACGCCGGGACTGC	0.0	0.0	0.0	2.0	0.0
TACGTAGGGAGCAAGCGTTATCCGGAATCATTGGGTTTAAAGGGAGCGTAGACGGCATCACAAGTCAGAAGTGAAAATCCGGGGCTCAACCCCGGAACTG	0.0	2.0	0.0	0.0	0.0
TACGTAGGGGGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTGCGTAGGCGGTTAATTAAGTTGGATGTGAAATTCCCGGGCTTAACTTGGGAGCTG	1.0	0.0	0.0	0.0	0.0


In [21]:
biom_table.ids(axis='observation')[:2]

array([ 'TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTTTTAAGTCAGCGGTGAAAGTCTGTGGCTCAACCATAGAATTG',
       'AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGACCGGCAAGTTGGAAGTGAAAACTATGGGCTCAACCTGGGAACTG'], dtype=object)

**ALPHA DIVERSITY OTU COUNT EXAMPLE**

In [22]:
from qiime2.plugins import feature_table, diversity, emperor

In [23]:
table_ar = qiime2.Artifact.import_data('FeatureTable[Frequency]', biom_table)

In [24]:
alpha = diversity.actions.alpha(table_ar, 'observed_otus')

In [30]:
diversity.actions.alpha?

In [39]:
alpha.alpha_diversity.export_data('alpha_dir')

In [40]:
alpha = pd.read_csv('alpha_dir/alpha-diversity.tsv', delimiter='\t')

In [41]:
alpha

Unnamed: 0.1,Unnamed: 0,observed_otus
0,10317.000029493.28264,178
1,10317.000007729.27978,218
2,10317.000015596.26674,150
3,10317.000013965.26924,176
4,10317.000041786.33048,128
