# Evaluating classifiers
* This pipeline shows how RESCRIPt can be used to evaluate and check the quality of databases.
* This will follow the method `evaluate-fit-classifier` (see the alternative `evaluate-cross-validate` in the full <a href="https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494" target="_blank"> tutorial</a>).
* Note: this classifier is built for Tolentino et al. (in prep) which aims to identify the taxonomy of nanoplanktonic eukaryotes

## Activating QIIME2 environment and installing relevant dependencies
Skip this if you have already activated and installed QIIME2 environment and relevant dependencies, respectively.

In [None]:
# Activate QIIME2 environment and install relevant dependencies
!conda install -c conda-forge -c bioconda -c qiime2 -c defaults xmltodict -y

<span style='color:red'>**Note:**</span> Import QIIME 2 python module to allow for inline visualizations

In [None]:
import qiime2 as q2

## Using RESCRIPt
This pipeline will mainly use QIIME2 software, the RESCRIPt plugin (Reference Sequence annotation and CuRatIon Pipeline) in particular.

Documentation: <a href="https://github.com/bokulich-lab/RESCRIPt" target="_blank"> Github link </a>

In [None]:
# Install RESCRIPt source:
!pip install git+https://github.com/bokulich-lab/RESCRIPt.git
    
# View help menu for using rescript
!qiime dev refresh-cache
!qiime rescript --help

## Pipeline
Based on Mike Robeson's pipeline for formatting SILVA data 
(see <a href="https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494" target="_blank">Documentation</a>)

### Preparing SILVA database
**Using the easy way**

In [None]:
## Get the SILVA data
# Skip this cell if you have locally downloaded files

!qiime rescript get-silva-data\
    --p-version '138.1'\
    --p-target 'SSURef_NR99'\
    --p-include-species-labels\
    --o-silva-sequences silva-138.1-ssu-nr99-seqs.qza\
    --o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza\
    
# There is another way if you already downloaded the files (see documentation)

**Using the hard way:**

Download SILVA files
First, we’ll need to go to the <a href="https://www.arb-silva.de/no_cache/download/archive/release_138.1/Exports/" target="_blank"> SILVA v138.1 archive </a> to obtain:

the following taxonomy files:

tax_slv_ssu_138.1.txt.gz
taxmap_slv_ssu_ref_nr_138.1.txt.gz
tax_slv_ssu_138.1.tre.gz
the sequence file:
SILVA_138.1_SSURef_Nr99_tax_silva_trunc.fasta.gz
You can download the files through your browser directly. We’ll make use of wget to download the files from the command line, then gunzip these files prior to importing into QIIME 2.

Download the Taxonomy Rank file. This maps the taxonomic rank and taxonomy to the taxid.

In [None]:
# Taxonomy Rank File
!wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz
!gunzip tax_slv_ssu_138.1.txt.gz

# Taxonomy Map File
!wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.1.txt.gz
!gunzip taxmap_slv_ssu_ref_nr_138.1.txt.gz

# Taxoomy Tree File
!wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/tax_slv_ssu_138.1.tre.gz
!gunzip tax_slv_ssu_138.1.tre.gz

# SILVA NR99 sequences (non-redundant and unaligned)
!wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz
!gunzip SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz

***Import SILVA files into QIIME 2***

In [None]:
# Import Taxonomy Rank File
!qiime tools import \
    --type 'FeatureData[SILVATaxonomy]' \
    --input-path tax_slv_ssu_138.1.txt \
    --output-path taxranks-silva-138.1-ssu-nr99.qza \

# Import Taxonomy Map File
!qiime tools import \
    --type 'FeatureData[SILVATaxidMap]' \
    --input-path taxmap_slv_ssu_ref_nr_138.1.txt \
    --output-path taxmap-silva-138.1-ssu-nr99.qza \

# Import Taxonomy Tree File
!qiime tools import \
    --type 'Phylogeny[Rooted]' \
    --input-path tax_slv_ssu_138.1.tre \
    --output-path taxtree-silva-138.1-nr99.qza \

# Import SILVA NR99 sequences (non-redundant and unaligned)
!qiime tools import \
    --type 'FeatureData[RNASequence]' \
    --input-path SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta \
    --output-path silva-138.1-ssu-nr99-seqs.qza

### Reverse Transcribe
It is not uncommon to find reference sequences only provided in RNA form, as is the case with the parsing of the SILVA reference data we performed earlier. What actually happened, “behind the scenes”, was that the SILVA sequences were reverse transcribed from RNA to DNA. Did you notice the FeatureData[RNASequence] type when we manually imported the sequences at the beginning? We’ll take that file and reverse transcribe manually here.

In [None]:
!qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-seqs.qza  \
    --o-dna-sequences silva-138.1-ssu-nr99-dna.qza

We are now ready to proceed with making our SILVA reference database within QIIME 2. First we’ll need to prepare the silva taxonomy prior to use. We’ll use parse-silva-taxonomy to do this and use the optional flag to include the species labels. But be wary, there are species label annotations that may be spurious! See the caveats about using species-labels later in this document.

In [None]:
# Default parsing
!qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree taxtree-silva-138.1-nr99.qza \
    --i-taxonomy-map taxmap-silva-138.1-ssu-nr99.qza \
    --i-taxonomy-ranks taxranks-silva-138.1-ssu-nr99.qza \
    --p-include-species-labels \
    --o-taxonomy silva-138.1-ssu-nr99-tax.qza

In [None]:
# Parsing taxonomy mapping to include specific taxonomic rank (target: k__Animalia)
# NOTE: species label is included in '--p-include-species-labels'
!qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree ~/analyses/ms398-2020/ref/taxtree-silva-138.1-nr99.qza \
    --i-taxonomy-map ~/analyses/ms398-2020/ref/taxmap-silva-138.1-ssu-nr99.qza \
    --i-taxonomy-ranks ~/analyses/ms398-2020/ref/taxranks-silva-138.1-ssu-nr99.qza \
    --p-include-species-labels \
    --p-ranks domain kingdom phylum class order family genus \
    --o-taxonomy silva-138.1-ssu-nr99-tax-kingdomspecies.qza

In [None]:
# Get all ranks from Silva then convert to appropriate indices (i.e., space-delimited strings)
a = ['domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order', 'suborder', 'superfamily', 'family', 'subfamily', 'genus']
listToStr = ' '.join([str(elem) for elem in a])
print(listToStr)

# This was done to check all the labels and to look for the label for Metazoa

In [None]:
# Parsing taxonomy mapping to include specific taxonomic rank (target: Metazoa)
!qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree ~/analyses/ms398-2020/ref/taxtree-silva-138.1-nr99.qza \
    --i-taxonomy-map ~/analyses/ms398-2020/ref/taxmap-silva-138.1-ssu-nr99.qza \
    --i-taxonomy-ranks ~/analyses/ms398-2020/ref/taxranks-silva-138.1-ssu-nr99.qza \
    --p-include-species-labels \
    --p-ranks domain kingdom phylum class order family genus \
    --o-taxonomy silva-138.1-ssu-nr99-tax-all.qza

# Note: k__Animalia was used to distinguish multicellular/macroorganisms 
# from microorganisms, instead

In [None]:
# Excluded species labels
# Species labels were excluded to minimize processing time/requirements
!qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree taxtree-silva-138.1-nr99.qza \
    --i-taxonomy-map taxmap-silva-138.1-ssu-nr99.qza \
    --i-taxonomy-ranks taxranks-silva-138.1-ssu-nr99.qza \
    --p-no-include-species-labels \
    --o-taxonomy silva-138.1-ssu-nr99-nospecies-tax.qza

### “Culling” low-quality sequences with cull-seqs
Here we’ll remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. These are the default parameters. See the --help text for more details.

In [None]:
!qiime rescript cull-seqs \
    --i-sequences ~/analyses/ms398-2020/ref/silva-138.1-ssu-nr99-dna.qza \
    --o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza

### Filtering sequences by length and taxonomy
Rather than blindly filter all of the reference sequences below a certain length, we’ll differentially filter based on the taxonomy of the reference sequence. The reason: if we decide to remove any sequences below 1000 or 1200 bp, then many of the reference sequences associated with Archaea (and some Bacteria) will be lost. This will potentially increase the retention of shorter and lower-quality Bacterial or Eukaryal sequences. Ultimately causing undue database selection bias. So, we’ll attempt to mitigate these issues by differentially filtering based on length. We will remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp. See help text for more info.

In [None]:
# Unfiltered (with species labels)
!qiime rescript filter-seqs-length-by-taxon \
    --i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy silva-138.1-ssu-nr99-tax-all.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza 

In [None]:
# Filtered (with species labels)
!qiime rescript filter-seqs-length-by-taxon \
    --i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy silva-138.1-ssu-nr99-nospecies-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs silva-138.1-ssu-nr99-nospecies-seqs-filt.qza \
    --o-discarded-seqs silva-138.1-ssu-nr99-nospecies-seqs-discard.qza 

### Filter to include Eukaryotes only
Sometimes, it's better to run further filtering steps to significantly reduce the data size and, hence, decrease computational requirements

In [None]:
# Filter the sequence file to include euks only
!qiime taxa filter-seqs\
    --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza \
    --i-taxonomy silva-138.1-ssu-nr99-tax-all.qza \
    --p-include Eukaryota \
    --o-filtered-sequences silva-138.1-seqs-euk.qza

In [None]:
# Filter the taxonomy mapping file to include euks only
!qiime rescript filter-taxa \
    --i-taxonomy silva-138.1-ssu-nr99-tax-all.qza \
    --p-include Eukaryota \
    --o-filtered-taxonomy silva-138.1-tax-euk.qza

### Dereplication of sequences and taxonomy
Given the notes outlined for the SILVA 138 and 138.1 NR99 release, there may be identical full-length sequences with either identical or different taxonomies. We’ll proceed to dereplicate this data before moving forward. This will help remove redundant sequence data from the database prior to downstream processing. RESCRIPt provide several options for sequence-taxonomy dereplication. See the hidden details below for more information.

#### Dereplicating in uniq mode
Here we will use the default uniq approach. That is, we’ll retain identical sequence records that have differing taxonomies. We’ll specify the option here for the sake of clarity, but feel free to use any of the --p-mode options that make sense to you.

In [None]:
### CHECKPOINT ###
# Take note of the name of the files, output file names were shortened
!qiime rescript dereplicate \
    --i-sequences silva-138.1-seqs-euk.qza \
    --i-taxa silva-138.1-tax-euk.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.1-seqs-euk-drep \
    --o-dereplicated-taxa silva-138.1-tax-euk-drep

A classifier can be made using the outputs from the above cell. However, this pipeline will continue to amplicon-region specific classifier.

### Get specific amplicon region sequences
Here, we’ll go into more detail about how to generate an amplicon-specific classifier. Constructing such a classifier allows for more robust taxonomic classification of your data (Werner et al. 2011; Bokulich et al. 2018).

Here, you’ll use the same primer sequences that you used for your PCR / sequencing, to extract the amplicon region from our reference database. Note, we enter the primer sequences in the 5’-3’ direction, i.e. as you would order the oligos from a vendor. Here, we’ll extract the V4 region using the primers E572F/E1009R by Comeau et al. (2011). Note that we’ll set --p-read-orientation 'forward', as the SILVA database is curated to be in the same “forward” orientation. This will allow us to process the data more quickly w/o having to account for mixed-orientation sequences during our primer search. We’ll continue by making use of the files from previous steps.

In [None]:
# Sequences with ranging 200-700 bps only
!qiime feature-classifier extract-reads \
    --i-sequences silva-138.1-seqs-euk-drep.qza \
    --p-f-primer CYGCGGTAATTCCAGCTC \
    --p-r-primer AYGGTATCTRATCRTCTTYG \
    --p-n-jobs 2 \
    --p-read-orientation 'forward' \
    --p-min-length 200 \
    --p-max-length 700 \
    --o-reads silva-138.1-seqs-euk-v4-200700.qza

### Look at database stats

In [None]:
!qiime feature-table tabulate-seqs \
    --i-data silva-138.1-seqs-euk-v4-200700.qza \
    --o-visualization silva-138.1-seqs-euk-v4-200700.qzv \

# Visualize
q2.Visualization.load('silva-138.1-seqs-euk-v4-200700.qzv')

### Dereplicate extracted region
Even though we already dereplicated our full-length sequences, we’ll do so again as the extracted amplicon regions may now be identical over this shorter region.

We may have unique sequences that point to quite different taxonomies (e.g. different genera). Thus, we need to decide how we’d like to handle the taxonomy by choosing one of the available --p-mode options.
Conversely, we may have had many different full-length sequences with identical taxonomy, and now these records, after extracting the amplicon region, are identical. As a result of this, we can reduce our database size by simply dereplicating them, as we’ve done earlier.

In [None]:
!qiime rescript dereplicate \
    --i-sequences silva-138.1-seqs-euk-v4-200700.qza \
    --i-taxa silva-138.1-tax-euk-drep.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.1-seqs-euk-v4-200700-drep.qza \
    --o-dereplicated-taxa silva-138.1-tax-euk-v4-200700-drep.qza

### Build amplicon-region specific classifier

In [None]:
!qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138.1-seqs-euk-v4-200700-drep.qza \
    --i-reference-taxonomy silva-138.1-tax-euk-v4-200700-drep.qza \
    --p-classify--chunk-size 10000 \
    --o-classifier silva-138.1-euk-v4-200700-classifier.qza