## Ryan Inghilterra - Bloom Removal

In [1]:
from biom import load_table
import skbio
import qiime2

read in the saved full_otu deblur biom table

btable contains all of the sOTUs for all of our AGP samples, from deblur context 'Deblur-Illumina-16S-V4-100nt-fbc5b2'

In [2]:
btable = load_table('../../3.4.ryan_i_greengenes/full_otus/feature-table.biom')

In [3]:
btable

34114 x 21238 <class 'biom.table.Table'> with 9262753 nonzero entries (1% dense)

In [4]:
seqs_file = 'newbloom.all.fna'

In [5]:
seqs = skbio.read(seqs_file, format='fasta')

In [6]:
seqs = skbio.read(seqs_file, format='fasta')
filter_seqs = {str(s) for s in seqs}

In [7]:
bloom_list = list(filter_seqs)

In [8]:
print(len(bloom_list))

20


20 blooms to filter for, we need to trim each to 100 length dna seq

In [9]:
bloom_list_trim = [bloom[:100] for bloom in bloom_list]

In [10]:
len(bloom_list_trim[0])

100

In [16]:
btable.head()

5 x 5 <class 'biom.table.Table'> with 2 nonzero entries (8% dense)

In [11]:
otu_seqs = btable.ids(axis='observation')

In [15]:
otu_seqs

array(['4386973', '243204', '539581', ..., '233009', '1015502', '253863'], dtype=object)

this is a manual way to look at which sOTUs out of 426648 which had a bloom seq match

In [14]:
i = 0
for otu in otu_seqs:
    if otu in set(bloom_list_trim):
        print(i, otu)    
    i = i+1

below we actually remove the filters, borrowing code from https://github.com/knightlab-analyses/bloom-analyses/blob/master/ipynb/bloom_example.ipynb

In [13]:
import skbio
import biom
import argparse
import sys

__version__='1.0'


def trim_seqs(seqs, seqlength=100):
    """
    Trims the sequences to a given length

    Parameters
    ----------
    seqs: generator of skbio.Sequence objects

    Returns
    -------
    generator of skbio.Sequence objects
        trimmed sequences
    """

    for seq in seqs:

        if len(seq) < seqlength:
            raise ValueError('sequence length is shorter than %d' % seqlength)

        yield seq[:seqlength]


def remove_seqs(table, seqs):
    """
    Parameters
    ----------
    table : biom.Table
       Input biom table
    seqs : generator, skbio.Sequence
       Iterator of sequence objects to be removed from the biom table.

    Return
    ------
    biom.Table
    """
    filter_seqs = {str(s) for s in seqs}
    _filter = lambda v, i, m: i not in filter_seqs
    return table.filter(_filter, axis='observation', inplace=False)

In [14]:
btable = load_table('../full_otus/feature-table.biom')
seqs_file = 'newbloom.all.fna'

In [15]:
seqs = skbio.read(seqs_file, format='fasta')

In [16]:
btable

426648 x 19231 <class 'biom.table.Table'> with 3606467 nonzero entries (0% dense)

In [17]:
length = min(map(len, btable.ids(axis='observation')))
seqs = trim_seqs(seqs, seqlength=length)

In [18]:
outtable = remove_seqs(btable, seqs)

In [19]:
outtable

426633 x 19231 <class 'biom.table.Table'> with 3540437 nonzero entries (0% dense)

In [20]:
table_ar = qiime2.Artifact.import_data('FeatureTable[Frequency]', outtable)

In [21]:
table_ar.export_data('no_bloom_full_otus')

In [22]:
nobtable = load_table('no_bloom_full_otus/feature-table.biom')

In [23]:
nobtable

426633 x 19231 <class 'biom.table.Table'> with 3540437 nonzero entries (0% dense)

removing blooms only removed 15 SOTUs (426648 to 426633)