**Author**: JW Debelius (justine.debelius@ki.se)<br>
**Date**: September 2021<br>
**Enviroment**: `qiime2-2021.8-dev`<br>
**Python version**: 3.8<br>
**Extra packages**: None<br>
**QIIME version**: 2021.8-dev<br>
**Extra Plugins**: q2-sidle (memory refactor); RESCRIPt (v. )

# Mock Community Data Preperation

## Background

This notebook will process four mock community samples sequenced with the Ion torrent metagenomic kit, originally published by [Barb et al, 2015](https://pubmed.ncbi.nlm.nih.gov/26829716/). In the original paper, the authors profiled four mock communites using the 6 primer proprietary Ion Torrent kit. This kit using 6 primer pairs to target 7 regions along the 16Ss gene with both forward and reverse reads possible.
* V2
* V4
* V8
* V3
* V67
* V9

In the original paper, the authors compared the performance for each region compared to the baseline. They clusstered sequences into de novo OTUs clustered with UPARSE; taxonomic assignment was made in QIIME 1 and then compared using abundance profiling, and deviation from published Shannon diversity. 

## Data download and avaliability

Sequences were deposided in SRA under accession SUB1054354. We used the [SRA CLI tools]() to download the sequences and the provided sample sheet. Sequence fastq files and the description were saved in the `mock` folder in the `data/input` directory. 

When they were deposited, the sequences were demultiplexed by sample, but not by region. The Ion Torrent kit produced reads for 12 regions (6forward and 6 reverse), meaning that to be able to use the sequences here, we need to split them into regions.

## Preprocessing Approach

For the sake of not hating everyone and everything in existing, we will take a somewhat less optimal approach and try to do a regional demux on  already denoised sequences, because otherwise both I and my computer willcry.         
So, Sequenced are prepared through the following steps: 

1. Sequences from each sample are filtered based on read length into batches based on read length. For this, we'll use 150, 200, 250, and 300 nt reads.
2. Import reads for each length using the manifest format 
3. Denoise the per-region sequences to their fixed read length using dada2-pyro which better handles the error profile. Trim the reads to the approriate read length during sequencing. 
4. Use RESCRIPt to orient the sequences so they have a consistent orientation
5. Align the reads against the reference sequences to seperate them into regions
6. Identify he regions through some kind of witch craft
7. Filter the sequences into samples and regions
8. Account for all the sequences that were lost during filtering, aligning, 
<!-- 9. Consider life choices and decide why there is such an obsession with mock communitites -->

## Set up

In [10]:
import os

import biom
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import skbio

%matplotlib inline

from qiime2 import Artifact, Metadata, Visualization

In [11]:
input_dir = 'data/inputs/mock/' # modify if you've placed the files in another location
output_dir = 'data/output/mock/' # Change this if you want a different output

In [12]:
read_lengths = np.array([200])
# read_lengths = np.array([200])

In [13]:
steps = {
    'split_by_length': {
        'run': True,
        'overwrite': True,
        'input_dir': input_dir,
        'output_dir': os.path.join(output_dir, '1.split_length')
    },
    'import_seqs': {
        'run': True,
        'overwrite': True,
        'input_dir': os.path.join(output_dir, '1.split_length'),
        'output_dir': os.path.join(output_dir, '2.split_manifest'),
    },
    'denoise_seqs': {
        'run': True,
        'overwrite': True,
        'input_dir':  os.path.join(output_dir, '2.split_manifest'),
        'output_dir':  os.path.join(output_dir, '3.denoised'),
    },
    'cluster_orientation': {
        'run': True,
        'overwrite': True,
        'input_dir': os.path.join(output_dir, '3.denoised'),
        'output_dir': os.path.join(output_dir, '4.split_orientation'),
        'references': {'fwd': 'data/reference/gg_13_8_88/88_otus.qza',
                       'rev': 'data/reference/gg_13_8_88/88_otus_rc.qza'},
    },
    'generate_alignment_reference': {
        'run': True,
        'input_refs': {
            'fwd': 'data/reference/gg_13_8_88/gg_88_otus_aligned.qza',
            'rev': 'data/reference/gg_13_8_88/gg_88_otus_aligned_rc.qza',
            },
        'taxonomy_fp': 'data/reference/gg_13_8_88/88_otu_taxonomy.qza',
        'output_refs': {
            'fwd': 'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_fwd.qza',
            'rev': 'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_rev.qza',
        },
        'keep_group': 'f__Enterobacteriaceae',
    },
    'align_to_reference': {
        'run': True,
        'overwrite': True,
        'input_dir':  os.path.join(output_dir, '4.split_orientation'),
        'output_dir': os.path.join(output_dir, '5.aligned_to_ref'),
        'reference_groups': 'f__Enterobacteriaceae',
        'references': {'fwd':'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_fwd.qza',
                       'rev': 'data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_rev.qza',
                       },
    },
    'extract_positions': {
        'run': True,
        'overwrite': True,
        'input_align_dir': os.path.join(output_dir, '5.aligned_to_ref'),
        'input_rep_seq_dir': os.path.join(output_dir,  '4.split_orientation'),
        'input_table_dir': os.path.join(output_dir, '3.denoised'),
        'output_dir': os.path.join(output_dir, '6.regional_identification'),
    },
    'split_to_region': {
        'run': True,
        'overwrite': True,
        'input_map_dir': os.path.join(output_dir, '6.regional_identification'),
        'input_data_dir': os.path.join(output_dir, '3.denoised'),
        'output_dir': os.path.join(output_dir, '7.regional_demux'),
    }
}

In [14]:
aligned_ref = Artifact.load('data/reference/gg_13_8_88/gg_88_otus_aligned.qza').view(pd.Series)
aligned_rev = aligned_ref.apply(lambda x: x.reverse_complement())
aligned_rev = Artifact.import_data('FeatureData[AlignedSequence]', aligned_rev)
aligned_rev.save('data/reference/gg_13_8_88/gg_88_otus_aligned_rc.qza')

'data/reference/gg_13_8_88/gg_88_otus_aligned_rc.qza'

In [15]:
Artifact.load('data/reference/gg_13_8_88/gg_88_otus_aligned.qza')

<artifact: FeatureData[AlignedSequence] uuid: f87e33c2-8105-49ea-82db-e9d0afc26a99>

In [16]:
samples = [fp.split('.')[0] for fp in os.listdir(input_dir)
           if (os.path.splitext(fp)[1] == '.fastq') & ('_1' in fp)]

In [17]:
samples

['SRR2182221_1', 'SRR2182220_1', 'SRR11180057_1', 'SRR2182222_1']

## Preprocessing

### Get preprocessing references

...|

### Split sequences into read lengths

We'll start by splitting sequences into batches based on the read lengths. For each sample,
we'll load the sequences, determine the read lengths, and then group the sequences according to the read lengths provided. 

In [18]:
def batch_sequence_by_length(sample, ori_fastq, output_dir, 
                             read_lengths=read_lengths):
    """
    Splits sequences in batches based on read on read lengths
    """
    # Loads the file
    seqs_with_qual = skbio.io.read(ori_fastq, 
                               format='fastq', 
                               phred_offset=33,
                               )
    # Determines the length and reads in the sequence
    seq_lengths = pd.DataFrame.from_dict(orient='index', data={
        seq.metadata['id']: {'seq': seq,
                             'length': len(seq),
                             }
        for seq in seqs_with_qual
    })
    # Groups the ssequences into batches based on the read lengths
    seq_group = pd.concat(axis=1, objs=[(seq_lengths['length'] > (length)) * 1 
                          for length in read_lengths]).sum(axis=1)
    seq_group.replace({i + 1: length for i, length in enumerate(read_lengths)},
                       inplace=True)
    seq_lengths['batch'] = seq_group
    
    seq_batches = \
        seq_lengths.groupby('batch', sort=False)['seq'].apply(lambda x: x.values)
    seq_batches = seq_batches.loc[read_lengths]
    
    for length, reads in seq_batches.iteritems():
        fp_ = f'{output_dir}/{length}.fastq'
        f_ = skbio.io.open(fp_, 'w')
        for seq in reads:
            seq.write(f_, format='fastq', phred_offset=33)
            
    return seq_lengths[['length', 'batch']]

In [None]:
# if steps['split_by_length']['run']: 

#     step_overwrite = steps['split_by_length']['overwrite']
#     step_output_dir = steps['split_by_length']['output_dir']

#     os.makedirs(step_output_dir, exist_ok=step_overwrite)
    
#     seq_length_summary = dict()

#     for sample in samples:
#         print(sample)

#         # Sets up the sample path and output directory
#         ori_fastq = os.path.join(input_dir, f'{sample}.fastq')
#         sample_dir = os.path.join(step_output_dir, sample)
#         os.makedirs(sample_dir, exist_ok=step_overwrite)

#         # Batches the sequences 
#         seq_lengths = \
#             batch_sequence_by_length(sample, ori_fastq, sample_dir, read_lengths)

#         seq_length_summary[sample] = seq_lengths

### Import the data into QIIME 2 using a manifest format

Having split the data from each sample by read length, we'll 

In [19]:
def build_manifest(samples, read_length):
    """
    Builds a sample manifest for a specified read length
    """
    manifest = pd.DataFrame.from_dict(orient='index', data={
        sample: {
            'absolute-filepath': os.path.abspath(f'{manifest_input_dir}/'
                                                 f'{sample}/{read_length}.fastq')
        }
        for sample in samples
    })
    manifest.index.set_names('sample-id', inplace=True)
    return Metadata(manifest)

In [20]:
if steps['import_seqs']['run']:
    manifest_input_dir = step_output_dir
    manifest_output_dir = steps['import_seqs']['output_dir']
    manifest_overwrite = steps['import_seqs']['overwrite']

    os.makedirs(manifest_output_dir, exist_ok=manifest_overwrite)

    for read_length in read_lengths:
        manifest_fp = f'{manifest_output_dir}/manifest_{read_length}.tsv'
        seqs_art_fp = f'{manifest_output_dir}/demux_reads_{read_length}.qza'
        seqs_vis_fp = f'{manifest_output_dir}/demux_reads_{read_length}.qzv'

        manifest = build_manifest([samples[0]], read_length)
        manifest.save(manifest_fp)

        !qiime tools import \
          --type 'SampleData[SequencesWithQuality]' \
          --input-path $manifest_fp \
          --output-path $seqs_art_fp \
          --input-format SingleEndFastqManifestPhred33V2

        !qiime demux summarize \
         --i-data $seqs_art_fp \
         --o-visualization $seqs_vis_fp

ERROR! Session/line number was not unique in database. History logging moved to new session 2891
^C
Traceback (most recent call last):
  File "/Users/jusdeb/miniconda3/envs/qiime2-2021.8-dev/bin/qiime", line 7, in <module>
    from q2cli.__main__ import qiime
  File "/Users/jusdeb/miniconda3/envs/qiime2-2021.8-dev/lib/python3.8/site-packages/q2cli/__main__.py", line 9, in <module>
    import click
  File "/Users/jusdeb/miniconda3/envs/qiime2-2021.8-dev/lib/python3.8/site-packages/click/__init__.py", line 7, in <module>
    from .core import Argument
  File "/Users/jusdeb/miniconda3/envs/qiime2-2021.8-dev/lib/python3.8/site-packages/click/core.py", line 2, in <module>
    import inspect
  File "/Users/jusdeb/miniconda3/envs/qiime2-2021.8-dev/lib/python3.8/inspect.py", line 40, in <module>
    import linecache
  File "/Users/jusdeb/miniconda3/envs/qiime2-2021.8-dev/lib/python3.8/linecache.py", line 11, in <module>
    import tokenize
  File "/Users/jusdeb/miniconda3/envs/qiime2-2021.8-de

### Denoise sequences

The recommendation for Ion Torrent sequencing is to denoise using dada2-denoise pyro **[citeation neededd]** so we'll follow that

In [None]:
if steps['denoise_seqs']['run']:
    denoised_input = steps['denoise_seqs']['input_dir']
    denoised_output = steps['denoise_seqs']['output_dir']
    denoised_overwrite = steps['denoise_seqs']['overwrite']
    os.makedirs(denoised_output, exist_ok=denoised_overwrite)

    for read_length in read_lengths:
        seqs_art_fp = f'{denoised_input}/demux_reads_{read_length}.qza'
        table_art_fp = f'{denoised_output}/table_{read_length}.qza'
        rep_seq_art_fp = f'{denoised_output}/rep_seq_{read_length}.qza'
        stats_art_fp = f'{denoised_output}/denosing_stats_{read_length}.qza'
        table_viz_fp = f'{denoised_output}/table_{read_length}.qzv'
        stats_viz_fp = f'{denoised_output}/denosing_stats_{read_length}.qzv'

        !qiime dada2 denoise-pyro \
         --i-demultiplexed-seqs $seqs_art_fp \
         --p-trunc-len $read_length \
         --p-hashed-feature-ids \
         --o-table $table_art_fp \
         --o-representative-sequences $rep_seq_art_fp \
         --o-denoising-stats $stats_art_fp

        !qiime metadata tabulate \
         --m-input-file $stats_art_fp \
         --o-visualization $stats_viz_fp

        !qiime feature-table summarize \
         --i-table $table_art_fp \
         --o-visualization $table_viz_fp

^C

Aborted!
[32mSaved Visualization to: data/output/mock/3.denoised/denosing_stats_200.qzv[0m


And now we have a set of full denoise tables with a fixed read lenght and mixed orientation and region reads. These are now ready for demultiplexing.

## Regional Demultiplexing


### Split the data into forward and reverse reads

We'll split the data by orientation using 

In [None]:
if steps['cluster_orientation']['run']:
    orient_input_dir = steps['cluster_orientation']['input_dir']
    orient_output_dir = steps['cluster_orientation']['output_dir']
    orient_overwrite = steps['cluster_orientation']['overwrite']
    references =  steps['cluster_orientation']['references']

    os.makedirs(orient_output_dir, exist_ok=orient_overwrite)

    flip_dir = {"fwd": 'rev', 'rev': 'fwd'}

    for read_length in read_lengths:
        input_seqs = f'{orient_input_dir}/rep_seq_{read_length}.qza'
        input_table = f'{orient_input_dir}/table_{read_length}.qza'

        for dir_, ref_fp in references.items():
            dir2 = flip_dir[dir_]
            cluster_fp = f'{orient_output_dir}/clustered_{read_length}_{dir_}.qza'
            table_fp = f'{orient_output_dir}/table_{read_length}_{dir_}.qza'
            discard_fp = f'{orient_output_dir}/discard_{read_length}_{dir2}.qza'

            !qiime vsearch cluster-features-closed-reference \
             --i-sequences $input_seqs \
             --i-table $input_table \
             --i-reference-sequences $ref_fp \
             --p-perc-identity 0.85 \
             --p-strand plus \
             --o-clustered-table $table_fp \
             --o-clustered-sequences $cluster_fp \
             --o-unmatched-sequences $discard_fp 

### Generate sub alignment references

In [None]:
if steps['generate_alignment_reference']['run']: 
    input_refs = steps['generate_alignment_reference']['input_refs']
    output_refs = steps['generate_alignment_reference']['output_refs']
    taxonomy_fp = steps['generate_alignment_reference']['taxonomy_fp']
    keep_group = steps['generate_alignment_reference']['keep_group']

    for dir_, input_ in input_refs.items():
        output = output_refs[dir_]
        if not os.path.exists(output):
            !qiime taxa filter-seqs \
             --i-sequences $input_ \
             --i-taxonomy $taxonomy_fp \
             --p-include $keep_group \
             --o-filtered-sequences $output

### Align the oriented reference data

In [None]:
if steps['align_to_reference']['run']: 
    align_input_dir = steps['align_to_reference']['input_dir']
    align_output_dir = steps['align_to_reference']['output_dir']
    align_overwrite = steps['align_to_reference']['overwrite']
    align_references =  steps['align_to_reference']['references']

    os.makedirs(align_output_dir, exist_ok=align_overwrite)

    for length in read_lengths:
        for dir_, ref_alignment_fp in align_references.items():
            rep_seq_fp = f'{align_input_dir}/discard_{read_length}_{dir_}.qza'
            aligned_output = (f'{align_output_dir}/repset_aligned._{length}_'
                              f'{dir_}.qza')
            !qiime alignment mafft-add \
             --i-alignment $ref_alignment_fp \
             --i-sequences $rep_seq_fp \
             --o-expanded-alignment $aligned_output

In [None]:
!qiime alignment mafft-add --help

In [None]:
align_references
# seqs = Artifact.load('data/reference/gg_13_8_88/gg_13_8_aligned_enterobacteraceae_fwd.qza').view(DN

### Extracts the starting position from the alignment

In [None]:
if steps['extract_positions']['run']:
    first_pos_align_dir = steps['extract_positions']['input_align_dir']
    first_pos_repseq_dir = steps['extract_positions']['input_rep_seq_dir']
    first_pos_table_dir_ = steps['extract_positions']['input_table_dir']
    first_pos_output_dir =  steps['extract_positions']['output_dir']
    first_pos_ovewrite = steps['extract_positions']['overwrite']

    os.makedirs(first_pos_output_dir, exist_ok=first_pos_ovewrite)

    for length in read_lengths:
        for dir_ in ['fwd', 'rev']:
            alignment_fp = \
                f'{first_pos_align_dir}/repset_aligned._{length}_{dir_}.qza'
            rep_seq_fp = \
                f'{first_pos_repseq_dir}/discard_{read_length}_{dir_}.qza'
            table_fp = f'{first_pos_table_dir_}/table_{read_length}.qza'
            output_pos_art = \
                f'{first_pos_output_dir}/starts-{length}-{dir_}.qza'
            output_pos_viz = \
                f'{first_pos_output_dir}/starts-{length}-{dir_}.qzv'

            !qiime sidle find-first-alignment-position \
             --i-alignment $alignment_fp \
             --i-representative-sequences $rep_seq_fp \
             --i-table $table_fp \
             --p-direction $dir_ \
             --o-position-summary $output_pos_art 

            !qiime sidle summarize-alignment-positions \
              --i-alignment $alignment_fp \
              --i-position-summary $output_pos_art \
              --p-sort-cols 'starting-position,sequence-counts' \
              --p-weight-by-abundance \
              --p-colormap 'viridis' \
              --p-heatmap-maskcolor 'k' \
              --o-visualization $output_pos_viz

### Regional Demultiplexing 

Based on the visualizaation, I can infer a set of starting positions. For the forward reads, I looked at feature with at least 2 sequences and a maxium relative abundnce of at least 1000 sequences. This gives me starting at 69, 303, 508, 914, 1043, 1285 for the forward reads. The reverse positions are a little harder because of the weird split in thata block around 400. But, we also read the starting position backward, so maybe it's not so weird? The reverse reads end up at 349, 534, 805, 1134, 1302, and 1455.

In [None]:
first_positions = {'fwd': [ 69, 303, 508,  914, 1043, 1285],
                   'rev': [349, 534, 805, 1143, 1302, 1455]
                   }

I'll use those positions to split the data.

In [None]:
if steps['split_to_region']['run']: 
    region_demux_overwrite = steps['split_to_region']['overwrite']
    region_demux_meta_dir = steps['split_to_region']['input_map_dir']
    region_demux_data_dir = steps['split_to_region']['input_data_dir']
    region_demux_ouput_dir = steps['split_to_region']['output_dir']

    os.makedirs(region_demux_ouput_dir, exist_ok=region_demux_overwrite)

    for length in read_lengths:
        for dir_, positions in first_positions.items():
            for pos in positions:
                input_table_fp = f'{region_demux_data_dir}/table_{length}.qza'
                input_rep_seq_fp = \
                    f'{region_demux_data_dir}/rep_seq_{length}.qza'

                meta_fp = f'{region_demux_meta_dir}/starts-{length}-{dir_}.qza'

                table_fp = \
                    f'{region_demux_ouput_dir}/table-{length}-{dir_}-{pos}.qza'
                rep_seq_fp = \
                    f'{region_demux_ouput_dir}/rep-seq-{length}-{dir_}-{pos}.qza'
                table_summary_fp = \
                     f'{region_demux_ouput_dir}/table-{length}-{dir_}-{pos}.qzv'

                where = f'[starting-position]="{pos}"'

                !qiime feature-table filter-features \
                 --i-table $input_table_fp \
                 --m-metadata-file $meta_fp \
                 --p-where $where \
                 --o-filtered-table $table_fp

                !qiime feature-table filter-seqs \
                 --i-data $input_rep_seq_fp \
                 --i-table $table_fp \
                 --o-filtered-data $rep_seq_fp
                
                !qiime feature-table summarize \
                 --i-table $table_fp \
                 --o-visualization $table_summary_fp
                
                if dir_ == 'rev':
                    !qiime sidle reverse-complement-sequence \
                     --i-sequence $rep_seq_fp \
                     --o-reverse-complement $rep_seq_fp

## Accounting

Finally, I'd like to determine how many sequences were lost and where they were lost. For this, I need the dada2 stats... 

In [86]:
read_counts = pd.DataFrame({
    sample: counts['batch'].value_counts()
    for sample, counts in seq_length_summary.items()
})
read_counts.index.set_names('read_length', inplace=True)
read_counts.columns.set_names('sample-id', inplace=True)
read_counts = read_counts.T
read_counts

dropped_reads = read_counts.drop(columns=read_lengths)

In [87]:
dada2_summaries = {
    length:  Artifact.load(f'data/output/mock/3.denoised/denosing_stats_{length}.qza')
}

# Reconstruction

## Database

To extract the positions, I'm going to use the most abundant features from each region and align them against the full reference set. For this analysis, I want to work with Silva 128 [cite]. I picked 128 specifically to be able to do phylogenetic tree reconstruction, although at this point, the phylogenny doesns't really matter, so I guess it's more for consistency with the simulated data ¯\\\_(ツ)\_/¯. I think we've demonstrated the use o fmultiple databases sucessfully, if not I can switch this to greengenes

In [9]:
reference_dir = 'data/reference/silva/'

In [11]:
reference_alignment = os.path.join(reference_dir, 'silva-128-99-aligned-seqs.qza')

# Filters the reference alignment

In [3]:
!qiime rescript

Usage: [34mqiime rescript[0m [OPTIONS] COMMAND [ARGS]...

  Description: Reference sequence annotation and curation pipeline.

  Plugin website: https://github.com/nbokulich/RESCRIPt

  Getting user support: Please post to the QIIME 2 forum for help with this
  plugin: https://forum.qiime2.org

[1mOptions[0m:
  [34m--version[0m    Show the version and exit.
  [34m--citations[0m  Show citations and exit.
  [34m--help[0m       Show this message and exit.

[1mCommands[0m:
  [34mcull-seqs[0m                    Removes sequences that contain at least the
                               specified number of degenerate bases and/or
                               homopolymers of a given length.

  [34mdegap-seqs[0m                   Remove gaps from DNA sequence alignments.
  [34mdereplicate[0m                  Dereplicate features with matching sequences
                               and taxonomies.

  [34medit-taxonomy[0m                Edit taxonomy strings with find and 

In [31]:
dir_ = 'fwd'
pos = 303
thresh = 1000
length = 200

In [45]:
top_feat_output_dir = 'data/output/mock/8.top-features'
top_feat_input_dir = 'data/output/mock/7.regional_demux/'
os.makedirs(top_feat_output_dir, exist_ok=True)

In [46]:
full_table_fp = f'{top_feat_input_dir}/table-{length}-{dir_}-{pos}.qza'
full_repseq_fp = f'{top_feat_input_dir}/rep-seq-{length}-{dir_}-{pos}.qza'
filt_table_fp = f'{top_feat_output_dir}/table-{length}-{dir_}-{pos}-{thresh}.qza'
filt_repseq_fp = f'{top_feat_output_dir}/rep-seq-{length}-{dir_}-{pos}-{thresh}.qza'

!qiime feature-table filter-features \
 --i-table $full_table_fp \
 --p-min-frequency $thresh \
 --o-filtered-table $filt_table_fp

!qiime feature-table filter-seqs \
 --i-data $full_repseq_fp \
 --i-table $filt_table_fp \
 --o-filtered-data $filt_repseq_fp

[32mSaved FeatureTable[Frequency] to: data/output/mock/8.top-features/table-200-fwd-303-1000.qza[0m
[32mSaved FeatureData[Sequence] to: data/output/mock/8.top-features/rep-seq-200-fwd-303-1000.qza[0m


Aligns the filtered sequences against the reference

In [47]:
!qiime alignment mafft-add \
 --i-alignment $reference_alignment \
 --i-sequences  $filt_repseq_fp \
 --p-addfragments \
 --o-expanded-alignment $top_feat_output_dir/silva-128-99-aligned-200-fwd-300-100-add.qza

[31m[1mPlugin error from alignment:

  Command '['mafft', '--preservecase', '--inputorder', '--thread', '1', '--addfragments', '/var/folders/bw/q064ds0d2795_6mxnrssf0l1gkw0rj/T/qiime2-archive-0s3ez_d2/de0bd300-82bc-45be-81fc-fb47108111c4/data/dna-sequences.fasta', '/var/folders/bw/q064ds0d2795_6mxnrssf0l1gkw0rj/T/qiime2-archive-2hhw1_ya/20a7cc70-0cff-46b3-a0a1-606dbd5b3147/data/aligned-dna-sequences.fasta']' returned non-zero exit status 1.

Debug info has been saved to /var/folders/bw/q064ds0d2795_6mxnrssf0l1gkw0rj/T/qiime2-q2cli-err-3sfvad0e.log[0m


In [None]:
!qiime alignment mafft-add \
 --i-alignment $reference_alignment \
 --i-sequences  $filt_repseq_fp \
 --p-no-addfragments \
 --o-expanded-alignment $top_feat_output_dir/silva-128-99-aligned-200-fwd-300-100-no-add.qza

In [2]:
!qiime rescript subsample-fasta --help

Usage: [34mqiime rescript subsample-fasta[0m [OPTIONS]

  Subsample a set of sequences (either plain or aligned DNA)based on a
  fraction of original sequences.

[1mInputs[0m:
  [34m[4m--i-sequences[0m ARTIFACT [32mFeatureData[AlignedSequence¹ | Sequence²][0m
                          Sequences to subsample from.              [35m[required][0m
[1mParameters[0m:
  [34m--p-subsample-size[0m PROPORTION [32mRange(0, 1, inclusive_start=False,[0m
    [32minclusive_end=True)[0m   Size of the random sample as a fraction of the
                          total count                           [35m[default: 0.1][0m
  [34m--p-random-seed[0m INTEGER Seed to be used for random sampling.
    [32mRange(1, None)[0m                                                [35m[default: 1][0m
[1mOutputs[0m:
  [34m[4m--o-sample-sequences[0m ARTIFACT [32mFeatureData[AlignedSequence¹ | Sequence²][0m
                          Sample of original sequences.             [35m[required][0