This notebook contains an example of using `redbiom` through it's Python API to extract a subset of American Gut Project samples. These data are then loaded into QIIME 2 for a mini beta-diversity analysis using UniFrac. This assumes we're using a QIIME 2 2018.11 environment that additionally has `redbiom` 0.3.0 installed. The exact commands I ran to install it are:

```
$ conda install nltk
$ pip install https://github.com/biocore/redbiom/archive/0.3.0.zip
```

In [55]:
import redbiom.summarize
import redbiom.search
import redbiom.fetch
import qiime2
from qiime2.plugins import feature_table
import pandas as pd

The first thing we're going to do is gather the `redbiom` contexts. A context is roughly a set of consistent technical parameters. For example, the specific sequenced gene, the variable region within the gene, the length of the read, and how the operational taxonomic units were assessed.

The reason `redbiom` partitions data into contexts is because these technical details can lead to massive technical bias. The intention is to facilitate comparing "apples" to "apples". 

The context summarization returns a pandas `DataFrame` so it should be pretty friendly to manipulate.

In [2]:
contexts = redbiom.summarize.contexts()

In [3]:
contexts.head()

Unnamed: 0,ContextName,SamplesWithData,FeaturesWithData,Description
0,Pick_closed-reference_OTUs-Greengenes-Illumina...,206,309,Qiita context
1,Deblur-Illumina-16S-V1-2-150nt-780653,27,1613,Deblur (Reference phylogeny for SEPP: Greengen...
2,Pick_closed-reference_OTUs-Greengenes-Titanium...,215,4811,Qiita context
3,Pick_closed-reference_OTUs-Greengenes-Titanium...,976,11836,Qiita context
4,Pick_closed-reference_OTUs-Greengenes-FASTA-16...,18,1706,Qiita context


At the present time, determining the context to use is a bit manual and requires some strung munging. Additional development is needed.

Let's take a look at the larger contexts.

In [4]:
contexts.sort_values('SamplesWithData', ascending=False).head()

Unnamed: 0,ContextName,SamplesWithData,FeaturesWithData,Description
9,Pick_closed-reference_OTUs-Greengenes-Illumina...,156761,83309,Qiita context
97,Pick_closed-reference_OTUs-Greengenes-Illumina...,145669,74959,Qiita context
92,Deblur-Illumina-16S-V4-100nt-fbc5b2,141809,5698310,Deblur (Reference phylogeny for SEPP: Greengen...
102,Pick_closed-reference_OTUs-Greengenes-Illumina...,141380,72999,Qiita context
75,Deblur-Illumina-16S-V4-90nt-99d1d8,137678,4576516,Deblur (Reference phylogeny for SEPP: Greengen...


In [5]:
ctx = contexts.sort_values('SamplesWithData', ascending=False).iloc[0]['ContextName']

In [6]:
ctx

'Pick_closed-reference_OTUs-Greengenes-Illumina-16S-V4-5c6506'

In [7]:
study_id = 10317  # the Qiita study ID of the American Gut Project is 10317
results = {}
for site in ['sebum', 'saliva', 'feces']:
    query = "where qiita_study_id==%d and env_material=='%s'" % (study_id, site)
    results[site] = redbiom.search.metadata_full(query)

In [8]:
for k, v in results.items():
    print(k, len(v))

saliva 1406
sebum 1289
feces 18659


In [9]:
to_keep = []
for k, v in results.items():
    to_keep.extend(list(v))

In [10]:
to_keep[:5]

['10317.000030115',
 '10317.000017317',
 '10317.000028264',
 '10317.000009458',
 '10317.000027738']

The last output cell shows what these IDs look like. These are Qiita sample IDs.

Now that we have some samples, let's get some data! What we're going to do is ask `redbiom` to obtain the sample data, for our `to_keep` samples, from the context we previously selected. What's happening behind the scenes is that the API is pulling out sparse vectors corresponding to the number of individual sequences observed for each operational taxonomic unit per sample, and additionally unmunging the names (as `redbiom` normalizes sample and feature identifiers). The output is then aggregated into what's called a BIOM `Table`, which is really just a rich object wrapped around a `scipy.sparse` matrix. 

You may noice two outputs on the return. The one we're ignoring represents "ambiguous" samples. Some sample identifiers are associated with multiple sequenced samples. This is because some samples may "fail" sequencing, where they didn't yield sufficient sequence data, and were rerun. These "failures" are still represented in Qiita, but are differentiated by the actual sequencing run they were on. This doesn't matter for the moment though.

In [11]:
biom_table, _ = redbiom.fetch.data_from_samples(ctx, to_keep)

In [12]:
biom_table

34114 x 21238 <class 'biom.table.Table'> with 9262753 nonzero entries (1% dense)

The `repr` output shows that we have roughly 13k OTUs (operational taxonomic units), and only 244 samples. What gives? We were supposed to get 300! Just because a sample has metadata does not mean it has sequence data. It is also possible that some of the samples haven't been run through the same bioinformatic processing (e.g., closed reference at 100nt).

More information on `biom` can be found [here](http://biom-format.org/). 

Let's play with the object for just a moment for familiarity.

In [13]:
biom_table.head()

5 x 5 <class 'biom.table.Table'> with 0 nonzero entries (0% dense)

In [14]:
print(biom_table.head())

# Constructed from biom file
#OTU ID	10317.000023681.3099	10317.000028593.3022	10317.000041303.26484	10317.000022953.3018	10317.000003894.2946
4391034	0.0	0.0	0.0	0.0	0.0
804828	0.0	0.0	0.0	0.0	0.0
633535	0.0	0.0	0.0	0.0	0.0
789411	0.0	0.0	0.0	0.0	0.0
4469492	0.0	0.0	0.0	0.0	0.0


Ah! We need a tree! Since we're using Greengenes, we can just rely on the existing prebuilt tree from the reference. Let's get that.

In [15]:
!wget ftp://ftp.microbio.me/greengenes_release/gg_13_8_otus/

--2019-06-02 15:18:56--  ftp://ftp.microbio.me/greengenes_release/gg_13_8_otus/
           => '.listing'
Resolving ftp.microbio.me... 169.228.46.98
Connecting to ftp.microbio.me|169.228.46.98|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /greengenes_release/gg_13_8_otus ... done.
==> PASV ... done.    ==> LIST ... done.

.listing                [ <=>                ]     511  --.-KB/s    in 0s      

2019-06-02 15:18:56 (32.5 MB/s) - '.listing' saved [511]

Removed '.listing'.
Wrote HTML-ized index to 'index.html' [1018].


In [18]:
!wget ftp://ftp.microbio.me/greengenes_release/gg_13_8_otus/trees/97_otus.tree

--2019-06-02 15:19:49--  ftp://ftp.microbio.me/greengenes_release/gg_13_8_otus/trees/97_otus.tree
           => '97_otus.tree'
Resolving ftp.microbio.me... 169.228.46.98
Connecting to ftp.microbio.me|169.228.46.98|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /greengenes_release/gg_13_8_otus/trees ... done.
==> SIZE 97_otus.tree ... 2542738
==> PASV ... done.    ==> RETR 97_otus.tree ... done.
Length: 2542738 (2.4M) (unauthoritative)


2019-06-02 15:19:54 (2.99 MB/s) - '97_otus.tree' saved [2542738]



In [19]:
tree_ar = qiime2.Artifact.import_data('Phylogeny[Rooted]', '97_otus.tree')

## Export full_otu table so dont have to reload everything later

In [20]:
from biom import load_table

In [23]:
table_ar = qiime2.Artifact.import_data('FeatureTable[Frequency]', biom_table)

In [24]:
table_ar.export_data('full_otus_greengenes')

In [26]:
!ls full_otus_greengenes

feature-table.biom


In [27]:
btable = load_table('full_otus_greengenes/feature-table.biom')

In [28]:
btable

34114 x 21238 <class 'biom.table.Table'> with 9262753 nonzero entries (1% dense)

In [29]:
import skbio
import biom
import argparse
import sys

__version__='1.0'


def trim_seqs(seqs, seqlength=100):
    """
    Trims the sequences to a given length

    Parameters
    ----------
    seqs: generator of skbio.Sequence objects

    Returns
    -------
    generator of skbio.Sequence objects
        trimmed sequences
    """

    for seq in seqs:

        if len(seq) < seqlength:
            raise ValueError('sequence length is shorter than %d' % seqlength)

        yield seq[:seqlength]


def remove_seqs(table, seqs):
    """
    Parameters
    ----------
    table : biom.Table
       Input biom table
    seqs : generator, skbio.Sequence
       Iterator of sequence objects to be removed from the biom table.

    Return
    ------
    biom.Table
    """
    filter_seqs = {str(s) for s in seqs}
    _filter = lambda v, i, m: i not in filter_seqs
    return table.filter(_filter, axis='observation', inplace=False)

In [34]:
table = load_table('full_otus_greengenes/feature-table.biom')
seqs_file = 'newbloom.all.fna'

In [35]:
table

34114 x 21238 <class 'biom.table.Table'> with 9262753 nonzero entries (1% dense)

In [36]:
seqs = skbio.read(seqs_file, format='fasta')

In [37]:
length = min(map(len, table.ids(axis='observation')))
seqs = trim_seqs(seqs, seqlength=length)

In [38]:
outtable = remove_seqs(table, seqs)

In [39]:
outtable

34114 x 21238 <class 'biom.table.Table'> with 9262753 nonzero entries (1% dense)

In [40]:
table_ar = qiime2.Artifact.import_data('FeatureTable[Frequency]', outtable)

## Rarification

In [45]:
sampling_depth=1000

In [47]:
# rarefy to 1000 sequences per sample (yes, it's arbitrary)
rare_ar, = feature_table.actions.rarefy(table=table_ar, sampling_depth=sampling_depth)

In [48]:
biom_table_rar = rare_ar.view(biom.Table)

In [49]:
biom_table_rar

19134 x 20304 <class 'biom.table.Table'> with 2519259 nonzero entries (0% dense)

In [50]:
print(biom_table_rar.head(3))

# Constructed from biom file
#OTU ID	10317.000023681.3099	10317.000028593.3022	10317.000041303.26484	10317.000022953.3018	10317.000015004.2917
4391034	0.0	0.0	0.0	0.0	0.0
789411	0.0	0.0	0.0	0.0	0.0
4469492	0.0	0.0	0.0	0.0	0.0


In [69]:
rare_ar.export_data('rar1000_greengenes')

## converting to sparse dataframe format

In [51]:
c = biom_table_rar.matrix_data

In [52]:
c = c.transpose()

In [53]:
c

<20304x19134 sparse matrix of type '<class 'numpy.float64'>'
	with 2519259 stored elements in Compressed Sparse Column format>

In [56]:
cdf = pd.SparseDataFrame(c)

In [57]:
print(cdf.shape)

(20304, 19134)


In [58]:
cdf['sample_name'] = biom_table_rar.ids()

In [59]:
cdf.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19125,19126,19127,19128,19129,19130,19131,19132,19133,sample_name
0,,,,,,,,,,,...,,,,,,,,,,10317.000023681.3099
1,,,,,,,,,,,...,,,,,,,,,,10317.000028593.3022
2,,,,,,,,,,,...,,,,,,,,,,10317.000041303.26484


**write biom sparse dataframe to pickle file**

In [61]:
cdf.to_pickle('greengenes_all_body_4.10.rar1000.biom_data.pkl')

## create dna_seq lookup_id which matches the column in cdf

In [62]:
otu_ids =   biom_table_rar.ids(axis='observation')

In [63]:
otu_id_df = pd.DataFrame(otu_ids, columns=['dna_seq'])

In [64]:
otu_id_df['lookup_id'] = range(len(otu_id_df)) 

In [65]:
otu_id_df.head()

Unnamed: 0,dna_seq,lookup_id
0,4391034,0
1,789411,1
2,4469492,2
3,820091,3
4,104141,4


**write id lookup to csv file**

In [66]:
otu_id_df.to_csv('greengenes_all_body_4.10.rar1000.biom_data.pkl', index=False)