This notebook will do initial processing on the metadata and biom tables to limit compute. We will 

In [2]:
import os

import biom
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn

% matplotlib inline

We'll start by loading the metadata and sOTU table.

In [3]:
meta = pd.read_csv('./01.metadata/ag_full_map.txt', sep='\t', dtype=str)
meta.set_index('#SampleID', inplace=True)

In [11]:
sotu = biom.load_table('./02.raw_tables/otu_table_no_blooms_125nt_with_tax_min1250.biom')

We'll add the sequencing depth to the metadata, and filter the file so exclude any sample not included in the biom table.

In [12]:
meta['seq_depth'] =  pd.Series(sotu.sum(axis='sample'), sotu.ids(axis='sample'), name='sequence_depth')
meta = meta.loc[sotu.ids(axis='sample')]

In [13]:
len(meta['host_subject_id'].unique())

10499

Then, we'll filter the metadata to include only fecal samples, since this is what the analysis will use.

In [16]:
fecal_meta = meta.loc[meta['body_habitat'] == 'UBERON:feces']

We'll fitler the mapping file to include only one sample per participant. We'll chose a sample per participant. Samples associated with a vioscreen survey will be prioritized.

In [17]:
single_ids = []
for id_, t in fecal_meta[['host_subject_id', 'vioscreen_questionnaire']].groupby('host_subject_id'):
    if len(t['vioscreen_questionnaire'].unique()) == 1:
        single_ids.append(np.random.choice(t.index, 1))
    else:
        single_ids.append(
            np.random.choice(t.loc[t['vioscreen_questionnaire'] == 'Egg White Standard Questionnaire (V3)'].index)
        )
single_ids = np.hstack(single_ids)

We'll then filter the data to a single fecal sample per individual.

In [18]:
single_meta = fecal_meta.loc[single_ids]
single_sotu = sotu.filter(single_ids, axis='sample')

Finally, we'll save the data.

In [20]:
if not os.path.exists('./02.build_package'):
    os.mkdir('./02.build_package')

single_meta.to_csv('./02.build_package/fecal_map_1250.txt', sep='\t', index_label="#SampleID")

with biom.util.biom_open('./deblur_no_blooms_125nt_1250.biom', 'w') as f_:
    single_sotu.to_hdf5(f_, 'single fecal sample per participant')