**Author**: JW Debelius (justine.debelius@ki.se)<br>
**Date**: September 2021<br>
**Enviroment**: `qiime2-2021.8-dev`<br>
**Python version**: 3.8<br>
**Extra packages**: None<br>
**QIIME version**: 2021.8-dev<br>
**Extra Plugins**: q2-sidle (memory refactor); RESCRIPt (v. )

# Mock Community Data Preperation

## Background

This notebook will process four mock community samples sequenced with the Ion torrent metagenomic kit, originally published by [Barb et al, 2015](https://pubmed.ncbi.nlm.nih.gov/26829716/). In the original paper, the authors profiled four mock communites using the 6 primer proprietary Ion Torrent kit. This kit using 6 primer pairs to target 7 regions along the 16Ss gene with both forward and reverse reads possible.
* V2
* V3
* V4
* V67
* V8
* V9

In the original paper, the authors compared the performance for each region compared to the baseline. They clusstered sequences into de novo OTUs clustered with UPARSE; taxonomic assignment was made in QIIME 1 and then compared using abundance profiling, and deviation from published Shannon diversity. 

## Data download and avaliability

Sequences were deposided in SRA under accession SUB1054354. We used the [SRA CLI tools]() to download the sequences and the provided sample sheet. Sequence fastq files and the description were saved in the `mock` folder in the `data/input` directory. 

When they were deposited, the sequences were demultiplexed by sample, but not by region. The Ion Torrent kit produced reads for 12 regions (6forward and 6 reverse), meaning that to be able to use the sequences here, we need to split them into regions.

## Preprocessing Approach

For the sake of not hating everyone and everything in existing, we will take a somewhat less optimal approach and try to do a regional demux on  already denoised sequences, because otherwise both I and my computer willcry.         
So, Sequenced are prepared through the following steps: 

1. Sequences from each sample are filtered based on read length into batches based on read length. For this, we'll use 150, 200, 250, and 300 nt reads.
2. Import reads for each length using the manifest format 
3. Denoise the per-region sequences to their fixed read length using dada2-pyro which better handles the error profile. Trim the reads to the approriate read length during sequencing. 
4. Use RESCRIPt to orient the sequences so they have a consistent orientation
5. Align the reads against the reference sequences to seperate them into regions
6. Identify he regions through some kind of witch craft
7. Filter the sequences into samples and regions
8. Account for all the sequences that were lost during filtering, aligning, 
<!-- 9. Consider life choices and decide why there is such an obsession with mock communitites -->

## Set up

In [1]:
import os

import biom
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import skbio

%matplotlib inline

from qiime2 import Artifact, Metadata, Visualization

In [2]:
input_dir = 'data/inputs/mock/' # modify if you've placed the files in another location
output_dir = 'data/output/mock/' # Change this if you want a different output
ref_dir = 'data/reference/' # Location of reference files

In [3]:
steps = {
    'import_seqs': {
        'run': True,
        'overwrite': True,
        'input_dir': input_dir,
        'output_dir': os.path.join(output_dir, '1.import'),
    },
    'denoise_seqs': {
        'run': True,
        'overwrite': True,
        'read_length': 200,
        'input_dir':  os.path.join(output_dir, '1.import'),
        'output_dir':  os.path.join(output_dir, '2.denoise'),
    },
    'cluster_orientation': {
        'run': True,
        'overwrite': True,
        'input_dir': os.path.join(output_dir, '2.denoise'),
        'output_dir': os.path.join(output_dir, '3.split_orientation'),
        'references': {'fwd': 'data/reference/gg_13_8_88/88_otus.qza',
                       'rev': 'data/reference/gg_13_8_88/88_otus_rc.qza'},
    },
    'generate_alignment_reference': {
        'run': True,
        'input_refs': {
            'fwd': 'data/reference/gg_13_8_88/gg_88_otus_aligned.qza',
            'rev': 'data/reference/gg_13_8_88/gg_88_otus_aligned_rc.qza',
            },
        'taxonomy_fp': 'data/reference/gg_13_8_88/88_otu_taxonomy.qza',
        'output_refs': {
            'fwd': 'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_fwd.qza',
            'rev': 'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_rev.qza',
        },
        'keep_group': 'f__Enterobacteriaceae',
    },
    'align_to_reference': {
        'run': True,
        'overwrite': True,
        'seq_input_dir':  os.path.join(output_dir, '3.split_orientation'),
        'table_input_dir': os.path.join(output_dir, '2.denoise'),
        'output_dir': os.path.join(output_dir, '4.aligned_to_ref'),
        'reference_groups': 'f__Enterobacteriaceae',
        'references': {'fwd':'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_fwd.qza',
                       'rev': 'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_fwd.qza',
                       },
    },
    'split_to_region': {
        'run': True,
        'overwrite': True,
        'input_map_dir': os.path.join(output_dir, '4.aligned_to_ref'),
        'input_data_dir': os.path.join(output_dir, '2.denoise'),
        'output_dir': os.path.join(output_dir, '5.regional_demux'),
    },
    'prepare_database': {
        'run': True,
        'overwrite': True,
        'input_seqs': os.path.join(ref_dir, 'silva/silva-128-99-seqs.qza'),
        'input_taxa': os.path.join(ref_dir, 'silva/silva-128-99-taxonomy.qza'),
        'input_alignment': os.path.join(ref_dir, 'silva/silva-128-99-aligned-seqs.qza'),
        'output_dir': os.path.join(output_dir, '6.database_alignment/'),
        'tmp_dir':   os.path.join(output_dir, 'data/output/6.database_alignment/tmp'),
    },
    'subsample_db': {
        'run': True,
        'overwrite': True,
        'fraction_subsample': 0.1,
        'seed': 1776,
        'input_alignment': os.path.join(output_dir, '6.database_alignment/silva-128-99-aligned-5-degen-bact-archea-only.qza'),
        'output_dir': os.path.join(output_dir, '6.database_alignment/'),
    },
    'align_regions_for_sub_db': {
        'run': True,
        'overwrite': True,
        'input_db_dir': os.path.join(output_dir, '6.database_alignment/'),
        'input_asv_dir': os.path.join(output_dir, '5.regional_demux'),
        'output_dir': os.path.join(output_dir, '7.regional_alignment'),
        'abundance_thresh': 1000,
    },
}

In [4]:
samples = [fp.split('.')[0] for fp in os.listdir(input_dir)
           if ('_1' in fp) & ('182' in fp)]

In [5]:
samples

['SRR2182222_1', 'SRR2182220_1', 'SRR2182221_1']

## Preprocessing

### Get preprocessing references

...|

### Import the data into QIIME 2 using a manifest format

Having split the data from each sample by read length, we'll 

In [10]:
def build_manifest(samples):
    """
    Builds a sample manifest for a specified read length
    """
    manifest = pd.DataFrame.from_dict(orient='index', data={
        sample: {
            'absolute-filepath': os.path.abspath(
                f'{manifest_input_dir}/{sample}.fastq.gz')
        }
        for sample in samples
    })
    manifest.index.set_names('sample-id', inplace=True)
    return Metadata(manifest)

In [11]:
if steps['import_seqs']['run']:
    manifest_input_dir = steps['import_seqs']['input_dir']
    manifest_output_dir = steps['import_seqs']['output_dir']
    manifest_overwrite = steps['import_seqs']['overwrite']

    os.makedirs(manifest_output_dir, exist_ok=manifest_overwrite)
    manifest_fp = f'{manifest_output_dir}/manifest.tsv'
    seqs_art_fp = f'{manifest_output_dir}/demux_reads.qza'
    seqs_vis_fp = f'{manifest_output_dir}/demux_reads.qzv'

    manifest = build_manifest(samples)
    manifest.save(manifest_fp)

    !qiime tools import \
      --type 'SampleData[SequencesWithQuality]' \
      --input-path $manifest_fp \
      --output-path $seqs_art_fp \
      --input-format SingleEndFastqManifestPhred33V2

    !qiime demux summarize \
     --i-data $seqs_art_fp \
     --o-visualization $seqs_vis_fp

[32mImported data/output/mock/1.import/manifest.tsv as SingleEndFastqManifestPhred33V2 to data/output/mock/1.import/demux_reads.qza[0m
[32mSaved Visualization to: data/output/mock/1.import/demux_reads.qzv[0m


### Denoise sequences

The recommendation for Ion Torrent sequencing is to denoise using dada2-denoise pyro **[citeation neededd]** so we'll follow that

In [12]:
if steps['denoise_seqs']['run']:
    denoised_input = steps['denoise_seqs']['input_dir']
    denoised_output = steps['denoise_seqs']['output_dir']
    denoised_overwrite = steps['denoise_seqs']['overwrite']
    length = steps['denoise_seqs']['read_length']
    os.makedirs(denoised_output, exist_ok=denoised_overwrite)

    seqs_art_fp = f'{denoised_input}/demux_reads.qza'
    table_art_fp = f'{denoised_output}/table_.qza'
    rep_seq_art_fp = f'{denoised_output}/rep_seq.qza'
    stats_art_fp = f'{denoised_output}/denosing_stats.qza'
    table_viz_fp = f'{denoised_output}/table.qzv'
    stats_viz_fp = f'{denoised_output}/denosing_stats.qzv'

    !qiime dada2 denoise-pyro \
     --i-demultiplexed-seqs $seqs_art_fp \
     --p-trunc-len $length \
     --p-hashed-feature-ids \
     --p-n-threads 4 \
     --o-table $table_art_fp \
     --o-representative-sequences $rep_seq_art_fp \
     --o-denoising-stats $stats_art_fp

    !qiime metadata tabulate \
     --m-input-file $stats_art_fp \
     --o-visualization $stats_viz_fp

    !qiime feature-table summarize \
     --i-table $table_art_fp \
     --o-visualization $table_viz_fp

[32mSaved FeatureTable[Frequency] to: data/output/mock/2.denoise/table_.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/2.denoise/rep_seq.qza[0m
[32mSaved SampleData[DADA2Stats] to: data/output/mock/2.denoise/denosing_stats.qza[0m
[32mSaved Visualization to: data/output/mock/2.denoise/denosing_stats.qzv[0m
[32mSaved Visualization to: data/output/mock/2.denoise/table.qzv[0m


And now we have a set of full denoise tables with a fixed read lenght and mixed orientation and region reads. These are now ready for demultiplexing.

## Regional Demultiplexing


### Split the data into forward and reverse reads

We'll split the data by clustering the sequences and then filter to retain the clustered sequences

In [34]:
os.listdir(steps['cluster_orientation']['input_dir'])

['table.qzv',
 'denosing_stats.qzv',
 'table_.qza',
 'rep_seq.qza',
 'denosing_stats.qza']

In [24]:
if steps['cluster_orientation']['run']:
    cluster_input = steps['cluster_orientation']['input_dir']
    cluster_output = steps['cluster_orientation']['output_dir']
    cluster_overwrite = steps['cluster_orientation']['overwrite']
    cluster_references = steps['cluster_orientation']['references']
    cluster_rename = {'fwd': 'rev', 'rev': 'fwd'}
    os.makedirs(cluster_output, exist_ok=cluster_overwrite)
    
    for dir_, ref_seq_fp in cluster_references.items():    
        seqs_in_fp = f'{cluster_input}/rep_seq.qza'
        table_in_fp = f'{cluster_input}/table_.qza'
        cluster_otu_dir = f'{cluster_output}/clustered-seqs-{dir_}/'        
        seq_discard_fp = f'{cluster_otu_dir}/unmatched_sequences.qza'
        keep_seqs_fp = f'{cluster_output}/rep-seqs-{dir_}.qza'
        keep_table_fp = f'{cluster_output}/table-{dir_}.qza'
        
        # Clusters the sequences closed reference at a low percent identity
        !qiime vsearch cluster-features-closed-reference \
         --i-sequences $seqs_in_fp \
         --i-table $table_in_fp \
         --i-reference-sequences $ref_seq_fp \
         --p-perc-identity 0.85 \
         --output-dir $cluster_otu_dir

        # Filters the table against the discarded sequences
        !qiime feature-table filter-seqs \
         --i-data $seqs_in_fp \
         --m-metadata-file $seq_discard_fp \
         --p-exclude-ids \
         --o-filtered-data $keep_seqs_fp
        
        !qiime feature-table filter-features \
         --i-table $table_in_fp \
         --m-metadata-file  $keep_seqs_fp \
         --o-filtered-table $keep_table_fp
        
        !rm -r $cluster_otu_dir

rm: data/output/mock/3.split_orientation/clustered-seqs-fwd/: No such file or directory
[32mSaved FeatureTable[Frequency] to: data/output/mock/3.split_orientation/clustered-seqs-fwd/clustered_table.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/3.split_orientation/clustered-seqs-fwd/clustered_sequences.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/3.split_orientation/clustered-seqs-fwd/unmatched_sequences.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/3.split_orientation/rep-seqs-fwd.qza[0m
[32mSaved FeatureTable[Frequency] to: data/output/mock/3.split_orientation/table-fwd.qza[0m
rm: data/output/mock/3.split_orientation/clustered-seqs-rev/: No such file or directory
[32mSaved FeatureTable[Frequency] to: data/output/mock/3.split_orientation/clustered-seqs-rev/clustered_table.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/3.split_orientation/clustered-seqs-rev/clustered_sequences.qza[0m
[32mSaved FeatureData[Sequence] t

### Generate sub alignment references

In [54]:
if steps['generate_alignment_reference']['run']: 
    input_refs = steps['generate_alignment_reference']['input_refs']
    output_refs = steps['generate_alignment_reference']['output_refs']
    taxonomy_fp = steps['generate_alignment_reference']['taxonomy_fp']
    keep_group = steps['generate_alignment_reference']['keep_group']

    for dir_, input_ in input_refs.items():
        output = output_refs[dir_]
        if not os.path.exists(output):
            !qiime taxa filter-seqs \
             --i-sequences $input_ \
             --i-taxonomy $taxonomy_fp \
             --p-include $keep_group \
             --o-filtered-sequences $output

### Align the oriented reference data and finds the starting positions

In [55]:
!qiime dev refresh-cache

[33mQIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.[0m


In [62]:
if steps['align_to_reference']['run']: 
    align_repseq_dir = steps['align_to_reference']['seq_input_dir']
    align_table_dir = steps['align_to_reference']['table_input_dir']
    align_output_dir = steps['align_to_reference']['output_dir']
    align_overwrite = steps['align_to_reference']['overwrite']
    align_references =  steps['align_to_reference']['references']

    os.makedirs(align_output_dir, exist_ok=align_overwrite)

    for dir_, ref_alignment_fp in align_references.items():
        rep_seq_fp = f'{align_repseq_dir}/rep-seqs-{dir_}.qza'
        table_fp = f'{align_table_dir}/table_.qza'

        alignment_fp = \
            f'{align_output_dir}/rep_set_aligned_{dir_}.qza'
        output_pos_art = \
            f'{align_output_dir}/starts-{dir_}.qza'
        output_pos_viz = \
            f'{align_output_dir}/starts-{dir_}.qzv'

        !qiime sidle map-alignment-positions \
         --i-alignment $ref_alignment_fp \
         --i-sequences $rep_seq_fp \
         --i-table $table_fp \
         --p-direction $dir_ \
         --p-no-add-fragments \
         --p-colormap viridis \
         --o-expanded-alignment $alignment_fp \
         --o-position-summary $output_pos_art \
         --o-position-map $output_pos_viz

[32mSaved FeatureData[AlignedSequence] to: data/output/mock/4.aligned_to_ref/rep_set_aligned_fwd.qza[0m
[32mSaved FeatureData[AlignmentPosSummary] to: data/output/mock/4.aligned_to_ref/starts-fwd.qza[0m
[32mSaved Visualization to: data/output/mock/4.aligned_to_ref/starts-fwd.qzv[0m
[32mSaved FeatureData[AlignedSequence] to: data/output/mock/4.aligned_to_ref/rep_set_aligned_rev.qza[0m
[32mSaved FeatureData[AlignmentPosSummary] to: data/output/mock/4.aligned_to_ref/starts-rev.qza[0m
[32mSaved Visualization to: data/output/mock/4.aligned_to_ref/starts-rev.qzv[0m


In [205]:
# from qiime2 import Visualization
# Visualization.load('data/output/mock/4.aligned_to_ref/starts-fwd.qzv')

In [206]:
# Visualization.load('data/output/mock/4.aligned_to_ref/starts-rev.qzv')

### Extracts the starting position from the alignment

### Regional Demultiplexing 

Based on the visualizaation, I can infer a set of starting positions. For the forward reads, I looked at feature with at least 2 sequences and a maxium relative abundnce of at least 1000 sequences. This gives me starting at 69, 303, 508, 914, 1043, 1285 for the forward reads. The reverse positions are a little harder because of the weird split in thata block around 400. But, we also read the starting position backward, so maybe it's not so weird? The reverse reads end up at 349, 534, 805, 1134, 1302, and 1455.

In [7]:
Artifact.load('data/output/mock/4.aligned_to_ref/starts-fwd.qza').view(Metadata)

Metadata
--------
137 IDs x 3 columns
starting-position: ColumnProperties(type='categorical')
sequence-counts:   ColumnProperties(type='numeric')
direction:         ColumnProperties(type='categorical')

Call to_dataframe() for a tabular representation.

In [6]:
first_positions = {'fwd': [ 69, 308, 516, 926, 1054, 1301],
                   'rev': [340, 541, 815, 1153, 1313, 1466]
                   }

I'll use those positions to split the data.

In [27]:
if steps['split_to_region']['run']: 
    region_demux_overwrite = steps['split_to_region']['overwrite']
    region_demux_meta_dir = steps['split_to_region']['input_map_dir']
    region_demux_data_dir = steps['split_to_region']['input_data_dir']
    region_demux_output_dir = steps['split_to_region']['output_dir']

    os.makedirs(region_demux_output_dir, exist_ok=region_demux_overwrite)

    for dir_, positions in first_positions.items():
        for pos in positions:
            input_table_fp = f'{region_demux_data_dir}/table_.qza'
            input_rep_seq_fp = \
                f'{region_demux_data_dir}/rep_seq.qza'

            meta_fp = f'{region_demux_meta_dir}/starts-{dir_}.qza'
            
            table_fp = \
                f'{region_demux_output_dir}/table-{dir_}-{pos}.qza'
            rep_seq_fp = \
                f'{region_demux_output_dir}/rep-seq-{dir_}-{pos}.qza'
            table_summary_fp = \
                 f'{region_demux_output_dir}/table-{dir_}-{pos}.qzv'
            
            where = f'[starting-position]="{pos}"'

            !qiime feature-table filter-features \
             --i-table $input_table_fp \
             --m-metadata-file $meta_fp \
             --p-where $where \
             --o-filtered-table $table_fp

            !qiime feature-table filter-seqs \
             --i-data $input_rep_seq_fp \
             --i-table $table_fp \
             --o-filtered-data $rep_seq_fp

            !qiime feature-table summarize \
             --i-table $table_fp \
             --o-visualization $table_summary_fp

            if dir_ == 'rev':
                !qiime sidle reverse-complement-sequence \
                 --i-sequence $rep_seq_fp \
                 --o-reverse-complement $rep_seq_fp

data/output/mock/5.regional_demux/table-fwd-69.qza
[32mSaved FeatureTable[Frequency] to: data/output/mock/5.regional_demux/table-fwd-69.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/5.regional_demux/rep-seq-fwd-69.qza[0m
[32mSaved Visualization to: data/output/mock/5.regional_demux/table-fwd-69.qzv[0m
data/output/mock/5.regional_demux/table-fwd-308.qza
[32mSaved FeatureTable[Frequency] to: data/output/mock/5.regional_demux/table-fwd-308.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/5.regional_demux/rep-seq-fwd-308.qza[0m
[32mSaved Visualization to: data/output/mock/5.regional_demux/table-fwd-308.qzv[0m
data/output/mock/5.regional_demux/table-fwd-516.qza
[32mSaved FeatureTable[Frequency] to: data/output/mock/5.regional_demux/table-fwd-516.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/5.regional_demux/rep-seq-fwd-516.qza[0m
[32mSaved Visualization to: data/output/mock/5.regional_demux/table-fwd-516.qzv[0m
data/output/mock/5.regi

## Accounting

Finally, I'd like to determine how many sequences were lost and where they were lost. For this, I need the dada2 stats... 

In [29]:
dada2_summaries = \
    Artifact.load(f'data/output/mock/2.denoise/denosing_stats.qza')


# Reconstruction

## Database

We're going to start by pre-filtering the database. For reasons I'm not quite sure I understand, it's not possible to do filtering on the alignmnet sequences directly, so I'm going to filter the reference sequences the way I want them, use *those* to filter the alignmnet set, and then generate a subset of aligned sequences I can use for analysis.
<!-- 
To extract the positions, I'm going to use the most abundant features from each region and align them against the full reference set. For this analysis, I want to work with Silva 128 [cite]. I picked 128 specifically to be able to do phylogenetic tree reconstruction, although at this point, the phylogenny doesns't really matter, so I guess it's more for consistency with the simulated data ¯\\\_(ツ)\_/¯. I think we've demonstrated the use o fmultiple databases sucessfully, if not I can switch this to greengenes -->

In [12]:
if steps['prepare_database']['run']:
    prep_db_overwrite = steps['prepare_database']['overwrite']
    prep_db_input_seqs = steps['prepare_database']['input_seqs']
    prep_db_input_taxa = steps['prepare_database']['input_taxa']
    prep_db_input_alignment = steps['prepare_database']['input_alignment']
    prep_db_output = steps['prepare_database']['output_dir']
    prep_db_tmp_dir = steps['prepare_database']['tmp_dir']

    os.makedirs(prep_db_output, exist_ok=prep_db_overwrite)
    os.makedirs(prep_db_tmp_dir, exist_ok=True)

    !qiime rescript cull-seqs \
     --i-sequences $prep_db_input_seqs \
     --p-num-degenerates 6 \
     --o-clean-sequences $prep_db_tmp_dir/silva-128-99-low-degen.qza 

    !qiime taxa filter-seqs \
     --i-sequences $prep_db_tmp_dir/silva-128-99-low-degen.qza \
     --i-taxonomy $prep_db_input_taxa \
     --p-include 'D_0__Bact,D_0__Arch' \
     --o-filtered-sequences $prep_db_tmp_dir/silva-128-99-low-degen-bact-archea.qza

    !qiime feature-table filter-seqs \
     --i-data $prep_db_input_alignment \
     --m-metadata-file $prep_db_tmp_dir/silva-128-99-low-degen-bact-archea.qza \
     --o-filtered-data $prep_db_output/silva-128-99-aligned-5-degen-bact-archea-only.qza

In [33]:
total_alignment = Artifact.load(f'{prep_db_output}/silva-128-99-aligned-5-degen-bact-archea-only.qza')

In [46]:
import daskjoin

In [44]:
# total_align_mas.consensus()

And then, we'll subsample the reference to 10% to fit the data

In [13]:
if steps['subsample_db']['run']:
    subsample_db_overwrite = steps['subsample_db']['overwrite']
    subsample_db_input_alignment = steps['subsample_db']['input_alignment']
    subsample_db_fraction = steps['subsample_db']['fraction_subsample']
    subsample_db_seed = steps['subsample_db']['seed']
    subsample_db_output_dir =  steps['subsample_db']['output_dir']
    
    os.makedirs(subsample_db_output_dir, exist_ok=subsample_db_overwrite)
    
    output_alignment = (f'{subsample_db_output_dir}silva-128-99-aligned-'
                        f'5-degen-bact-archea-only-{subsample_db_fraction}.qza')
    
    !qiime rescript subsample-fasta \
     --i-sequences $subsample_db_input_alignment \
     --p-random-seed $subsample_db_seed \
     --p-subsample-size $subsample_db_fraction \
     --o-sample-sequences $output_alignment

[32mSaved FeatureData[AlignedSequence] to: data/output/mock/6.database_alignment/silva-128-99-aligned-5-degen-bact-archea-only-0.1.qza[0m


# Performs regional alignment

In [None]:
steps['align_regions_for_sub_db']

{'run': True,
 'overwrite': True,
 'input_db_dir': 'data/output/mock/6.database_alignment/',
 'input_asv_dir': 'data/output/mock/5.regional_demux',
 'output_dir': 'data/output/mock/7.regional_alignment',
 'abundance_thresh': 1000}

In [18]:
align_regions_db_input = steps['align_regions_for_sub_db']['input_db_dir']
align_asv_input = steps['align_regions_for_sub_db']['input_asv_dir']

align_regions_overwrite = steps['align_regions_for_sub_db']['overwrite']
align_asv_out = steps['align_regions_for_sub_db']['output_dir']
os.makedirs(align_asv_out, exist_ok=align_regions_overwrite)

ref_align = f'{align_regions_db_input}/silva-128-99-aligned-5-degen-bact-archea-only-0.1.qza'

dir_ = 'fwd'
pos = 308
thresh = steps['align_regions_for_sub_db']['abundance_thresh']

In [20]:
region_dir = f'{align_asv_out}/{dir_}-{pos}'
os.makedirs(region_dir)

full_table_fp = f'{align_asv_input}/table-{dir_}-{pos}.qza'
full_repseq_fp = f'{align_asv_input}/rep-seq-{dir_}-{pos}.qza'

!qiime feature-table filter-features \
 --i-table $full_table_fp \
 --p-min-frequency $thresh \
 --o-filtered-table $region_dir/table.qza

!qiime feature-table filter-seqs \
 --i-data $full_repseq_fp \
 --i-table $region_dir/table.qza \
 --o-filtered-data 

[32mSaved FeatureTable[Frequency] to: data/output/mock/7.regional_alignment/fwd-308/table.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/7.regional_alignment/fwd-308/rep-seq.qza[0m


And then, I want to preform local regional alignment against the reference for each region.

In [2]:
!qiime sidle

Usage: [34mqiime sidle[0m [OPTIONS] COMMAND [ARGS]...

  Description: This plugin reconstructs a full 16s sequence from short reads
  over a marker gene region using the Short MUltiple Read Framework (SMURF)
  algorithm.

  Plugin website: https://github.com/jwdebelius/q2-sidle

  Getting user support: Please post to the QIIME 2 forum for help with this
  plugin: https://forum.qiime2.org

[1mOptions[0m:
  [34m--version[0m    Show the version and exit.
  [34m--citations[0m  Show citations and exit.
  [34m--help[0m       Show this message and exit.

[1mCommands[0m:
  [34malign-regional-kmers[0m            Aligns ASV representative sequences to a
                                  regional kmer database.

  [34mfind-alignment-span-positions[0m   Finds the first and last positions of the
                                  representative sequences in the alignment

  [34mfind-and-prepare-regional-seqs[0m  Extracts kmer sequences from aligned
                                 