**Author**: Justine Debelius<br>
**email**: jdebelius@ucsd.edu<br>
**enviroment**: qiime2-2017.4<br>
**Date**: 8 May 2017

This notebook will assemble a processing package for American Gut data. We'll build a set of directories which contain the following set of files:

**Metadata**<br>
The sample and prep metadata downloaded from Qiita, along with appended Vioscreen results.
Alpha diversity (PD whole tree, shannon, and observed OTUs) for the rarefaction depth, and every depth lower have been added.

**OTU table**<br>
A rarefied and unrarefied biom table are provided. The rarefied table is denoted by the rarefaction depth. The unrarefied table is filtered for samples with fewer than the number of reads denoted by the rarefaction level (e.g., all samples with fewer than 1250 sequences are removed from the unrarefied 1250 biom table).

**Distance Matrices**<br>
The weighted, normalized-weighted, unweighted UniFrac and Bray-Curtis distance are provided.

**PICRUSt**<br>
PICRUSt prediction based on clustering the deblur sequences against the greengenes 13_8 OTU database at 99% and then performing PICRUSt prediction. Tables were filtered to remove samples with fewer than 1250 sequences/sample before normalization for 16S copy number. Tables collapsed at L1, L2, and L3 are also included.

In [None]:
import os
import shutil

import biom
import h5py

import numpy as np
import pandas as pd

Set the depths and metrics

In [None]:
depths = ['1250', '2500', '5000', '10000', '50000']
alpha_metrics = ['observed_otus', 'faiths_pd', 'shannon']

Creates directories for the packaged data

In [None]:
for depth in depths:
    dir_path = './03.packaged/%s' % depth
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)

We'll load the unrarefied table, since rarefaction strips the taxonommy, so this can be added back.

In [None]:
raw_t = load_table('./03.packaged/1250/deblur_125nt_no_blooms.biom')
taxa_lookup = {id_: {'taxonomy': raw_t.metadata(id_, axis='observation')['taxonomy']}
               for id_ in raw_t.ids(axis='observation')}

Then, we'll walk through the tables, add taxonomy, and save them to the new directory.

In [None]:
for depth in depths:
    # Sets up the file paths
    old_table_fp = './02.build_package/02.rarefied/%s/deblur_125nt_0.qza' % depth
    new_table_fp = './03.packaged/%s/deblur_125nt_no_blooms_rare.biom' % (depth)
    temp_dir = './table_temp'
    
    # Extracts the feature table into a biom table
    !qiime tools export $old_table_fp --output-dir $temp_dir
    
    table_ = biom.load_table(os.path.join(temp_dir, 'feature-table.biom'), new_table_fp)
    table_.add_metadata(taxa_lookup, axis='observation')
    
    with biom.util.biom_open(new_table_fp, 'r'):
        table_.to_hdf5(f_, 'rarefied with taxa')
    
    os.removedirs(temp_dir)

We next load metadata and combine it with the alpha diversity.

In [None]:
map_ = pd.read_csv('./02.build_package/01.input/fecal_map_1250.txt', sep='\t', dtype=str)
map_.set_index('#SampleID', inplace=True)

In [None]:
map_.drop(['index'], axis='columns', inplace=True)

For each sequencing depth, starting with the shallowest rarefaction depth, we'll append the diversity to the mapping file and then filter to remove only samples that contain those samples. 

To do this, we need to export the alpha diversity artifacts, since the QIIME 2 python API doesn't handle the American Gut barcodes well.

In [None]:
sample_ids = {}
for depth in ['1250', '2500', '5000', '10000', '50000']:
# for depth in ['50000']:
    collated = []
    for metric in ['observed_otus', 'faiths_pd', 'shannon']:
#     for metric in ['faiths_pd']:
        alpha_ = []
        for i in np.arange(0, 10):
            alpha_artifact = './02.build_package/03.alpha/%s/%s_r%i.qza' % (depth, metric, i)
            alpha_dir = './02.build_package/03.alpha/%s/%s/' % (depth, metric)
            
            !qiime tools export $alpha_artifact --output-dir $alpha_dir
            
            alpha_.append(pd.read_csv(os.path.join(alpha_dir, 'alpha-diversity.tsv'),
                                      sep='\t', dtype=str
                                     ).rename(columns={'Unnamed: 0': '#SampleID', 
                                                       metric: '%s_%s_%i' % (metric, depth, i)}
                                             ).set_index('#SampleID'))
        alpha_ = pd.concat(alpha_, axis=1).astype(float)
        collated.append(alpha_)
        map_.loc[alpha_.index, '%s_%s' % (metric, depth)] = alpha_.mean(1)
    collated = pd.concat(collated, axis=1)
    collated.to_csv('./03.packaged/%s/collated_alpha.txt' % depth, sep='\t', index_label='#SampleID')
    sample_ids[depth] = collated.index
    map_.dropna().to_csv('./03.packaged/%s/ag_map_with_alpha.txt' % depth,
                         sep='\t', index_label='#SampleID')
    with open('./03.packaged/%s/sample_id.txt' % depth, 'w') as f_:
        f_.write('\n'.join(collated.index))

We'll save the list of sample IDs and then filter the unrarefied deblur biom tables and picrust tables.

In [None]:
picrust_path = './02.raw_tables/picrust/otu_table_no_blooms_125nt_with_tax_min1250_gg99_normed_pred%s.biom'
picrust_tables = {
#     'raw': biom.load_table(picrust_path % ''),
    '_L1': biom.load_table(picrust_path % '_L1'),
    '_L2': biom.load_table(picrust_path % '_L2'),
    '_L3': biom.load_table(picrust_path % '_L3'),
    }

In [None]:
picrust_out = './03.packaged/%s/picrust/deblur_no_blooms_125nt_min1250_gg99_normed_pred%s.biom'

for depth in ['1250', '2500', '5000', '10000', '50000']:
    sample_path = './03.packaged/%s/sample_id.txt' % depth
    # Filters the unrarefied biom table to the appropriate depth
    !biom subset-table \
        --input-hdf5-fp ./02.build_package/01.input/deblur_no_blooms_125nt_1250.biom \
        --ids $sample_path \
        --axis sample \
        --output-fp ./03.packaged/$depth/deblur_125nt_no_blooms.biom
    
    # Creates a directory with the PICRUSt data
    if not os.path.exists('./03.packaged/%s/picrust' % depth):
        os.makedirs('./03.packaged/%s/picrust' % depth)

    # Filters the picrust tables and puts them in the directory
    for collapse in ['_L1', '_L2', '_L3']:
        picrust_fp_in = picrust_path % collapse
        picrust_fp_out = picrust_out % (depth, collapse)
        !biom subset-table \
            --input-hdf5-fp $picrust_fp_in \
            --output-fp $picrust_fp_out \
            --ids $sample_path \
            --axis 'sample'


In [None]:
from biom.table import vlen_list_of_str_formatter

from json import dumps, loads

In [None]:
def picrust_parser(*args):
    new_value = []
    for arg in args:
         new_value.append(loads(arg[0]))
    return new_value if new_value else None


def picrust_formatter(*args):
    """Transform, and format
    
    Taken directly from
    https://github.com/picrust/picrust/blob/master/picrust/util.py#L474
    per Daniel McDonald's directions to get the complex PICRUSt data
    to work...
    """
    return vlen_list_of_str_formatter(*list_of_list_of_str_formatter(*args))

def list_of_list_of_str_formatter(grp, header, md, compression):
    """Serialize [[str]] into a BIOM hdf5 compatible form
    Parameters
    ----------
    grp : h5py.Group
        This is ignored. Provided for passthrough
    header : str
        The key in each dict to pull out
    md : list of dict
        The axis metadata
    compression : bool
        Whether to enable dataset compression. This is ignored, provided for
        passthrough
    Returns
    -------
    grp : h5py.Group
        The h5py.Group
    header : str
        The key in each dict to pull out
    md : list of dict
        The modified metadata that can be formatted in hdf5
    compression : bool
        Whether to enable dataset compression.
    Notes
    -----
    This method is intended to be a "passthrough" to BIOM's
    vlen_list_of_str_formatter method. It is a transform method.
    """
    new_md = [{header: np.atleast_1d(np.asarray(dumps(m[header])))} for m in md]
    return (grp, header, new_md, compression)

In [None]:
format_fs = {'KEGG_Description': picrust_formatter,
             'COG_Description': picrust_formatter,
             'KEGG_Pathways': picrust_formatter,
             'COG_Category': picrust_formatter
            }

In [None]:
parse_fs = {'KEGG_Pathways': picrust_parser, 'KEGG_Description': picrust_parser}

In [None]:
fp_in = './02.raw_tables/picrust/otu_table_no_blooms_125nt_with_tax_min1250_gg99_normed_pred.biom'
with h5py.File(fp_in) as f_:
    t_in = biom.Table.from_hdf5(f_, parse_fs=parse_fs)

In [None]:
for depth in depths[::-1]:
    print(depth)
    id_ = sample_ids[depth]
    picrust_out = './03.packaged/%s/picrust/deblur_no_blooms_125nt_min1250_gg99_normed_pred.biom' % depth
    with h5py.File(fp_in) as f_:
        t_in = biom.Table.from_hdf5(f_, parse_fs=parse_fs)
    
    t_depth = t_in.filter(id_, axis='sample', inplace=False)
    
    with h5py.File(picrust_out, 'w') as fp_:
        t_depth.to_hdf5(fp_, 'filtered for %s sequences/sample' % depth, format_fs=format_fs)

Finally, we'll move the beta diversity files over, from the beta folder. To do this, we'll extract the data, and then move the table into the appropriate directory.

In [None]:
# for depth in depths:
for depth in depths:
    temp_dir = './02.build_package/04.beta/%s/temp_dir/' % depth
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    if not os.path.exists('./03.packaged/%s/distance' % depth):
        os.makedirs('./03.packaged/%s/distance' % depth)
    for metric in ['bray_curtis', 'unweighted', 'weighted-normalized', 'weighted-unnormalized']:
#     for metric in ['bray_curtis']:
        old_filepath = './02.build_package/04.beta/%s/%s.qza' % (depth, metric)
        new_path = './03.packaged/%s/distance/%s.txt'
        !qiime tools extract \
            $old_filepath \
            --output-dir $temp_dir
        uuid_dir = os.path.join(temp_dir, os.listdir(temp_dir)[0])
        full_fp = os.path.join(uuid_dir, 'data/distance-matrix.tsv')
        shutil.move(full_fp, new_path % (depth, metric))
        !rm -r $uuid_dir
    os.removedirs(temp_dir)