Here the we aim to assign taxonomic identity to the set of denovo OTUs obtained from the DNA and eDNA samples taken from river catchments in the UK.

Here we are using a full GenBank as a reference database (this will be a curated invertebrate database in future runs). Taxonomic assignment will be performed using two different approaches:

 - BLAST based LCA
 - Kraken (k-mer based sequence classification)
 
We will again be using metaBEAT to facilitate reproducibility.

The final result of this notebook will be a taxonomically annotated OTU table in BIOM format from each approach, which I can then go and compare. BIOM format and the associated set of python functions has been developed as a standardized format for representing 'biological sample by observation contingency tables' in the -omics area.

Most of the input data was produced during processing of the eDNA samples.

I must specify location and file format reference sequences come in. Different formats (fasta, Genbank) can be mixed and matched. A simple text file that contains the path to the file and the format specification must be prepared.


**Notes for using the invert_database when it is finished**

The reference sequences in Genbank/fasta format are contained in the directory Reference_Alignment. The files is called `12S_UK...._SATIVA_cleaned.gb' and additional fasta files containing Sanger sequences to supplement records on genbank.

Produce the text file containing the invert_database reference sequences using the command line - We call it REFmap.txt.

In [None]:
pwd

In [None]:
mkdir taxonomic_assignment

In [None]:
cd taxonomic_assignment/

In [None]:
ls ../../reference_database/CO1_refdb/

In [None]:
!echo '../../reference_database/CO1_refdb/CO1_Acaria_Aracnida_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_amphipoda-part_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_Anellida_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_coccinellidae_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_Coleoptera_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_Crustacea_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_EPNM_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_Hemiptera-Hymenoptera_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_Mollusca_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_Odonata_SATIVA_cleaned.gb\tgb\n' \
'../../reference_database/CO1_refdb/CO1_Trichoptera_Lepidoptera_SATIVA_cleaned.gb\tgb' > REFmap.txt

In [None]:
!cat REFmap.txt

**As we have no invert_database yet, we will be blasting against full online NCBI**

Produce the text file containing non-chimera query sequences - Querymap.txt

In [None]:
%%bash

#Querymap
for a in $(ls -l ../chimera_detection/ | grep "^d" | perl -ne 'chomp; @a=split(" "); print "$a[-1]\n"')
do
    echo -e "$a-nc\tfasta\t../chimera_detection/$a/$a-nonchimeras.fasta"
done > Querymap.txt

In [None]:
!cat Querymap.txt

The Querymap.txt file has been made but include the GLOBAL directory in which all centroids and queries are contained (line 514). This will cause metaBEAT to fail so must be removed manually from the Querymap.txt file.

In [None]:
!sed '/GLOBAL/d' Querymap.txt > Querymap_final.txt

In [None]:
!cat Querymap_final.txt

That's almost it. Now start the pipeline to do sequence clustering and taxonomic assignment of non-chimera queries via metaBEAT. As input, Querymap.txt containing samples that have been trimmed, merged and checked for chimeras, and the REFmap.txt file must be specified. metaBEAT will be asked to attempt taxonomic assignment with the two different approaches mentioned above.

Kraken requires a specific database that metaBEAT will build automatically if necessary.
metaBEAT will automatically wrangle the data into the particular file formats that are required by each of the methods, run all necessary steps, and finally convert the outputs of each program to a standardized BIOM table.
GO!

In [1]:
!metaBEAT_global.py -h

usage: metaBEAT.py [-h] [-Q <FILE>] [-B <FILE>] [--g_queries <FILE>] [-v] [-s]
                   [-f] [-p] [-k] [-t] [-b] [-m <string>] [-n <INT>] [-E] [-e]
                   [--read_stats_off] [--PCR_primer <FILE>] [--bc_dist <INT>]
                   [--trim_adapter <FILE>] [--trim_qual <INT>] [--phred <INT>]
                   [--trim_window <INT>] [--read_crop <INT>]
                   [--trim_minlength <INT>] [--merge] [--product_length <INT>]
                   [--merged_only] [--forward_only] [--length_filter <INT>]
                   [--length_deviation <FLOAT>] [-R <FILE>] [--gb_out <FILE>]
                   [--rec_check] [--gb_to_taxid <FILE>] [--cluster]
                   [--clust_match <FLOAT>] [--clust_cov <INT>]
                   [--blast_db <PATH>] [--blast_xml <PATH>]
                   [--update_taxonomy] [--taxonomy_db <FILE>]
                   [--min_ident <FLOAT>] [--min_ali_length <FLOAT>]
                   [--bitscore_skim_LCA <FLOAT>] [--bitsc

In [None]:
%%bash

metaBEAT_global.py \
-Q Querymap_final.txt \
-R REFmap.txt \
--cluster --clust_match 0.97 --clust_cov 2 \
--blast --min_ident 0.97 --min_ali_length 0.8 \
-m COI -n 5 \
-E -v \
-@ M.Benucci@2015.hull.ac.uk \
-o CO1DvAug18-merge-forwonly-nonchimera-c97-cov2_refdb-id97 &> log0.97

In [None]:
!tail -n 50 log0.97

Preparing and blasting unassigned sequences

In [None]:
import metaBEAT_global_misc_functions as mb

In [None]:
pwd

In [None]:
cd ../taxonomic_assignment/GLOBAL/BLAST_0.97/

In [None]:
cd ../../

In [None]:
!ls GLOBAL/BLAST_0.97/

In [None]:
table = mb.load_BIOM('./GLOBAL/BLAST_0.97/CO1DvAug18-merge-forwonly-nonchimera-c97-cov2_refdb-id97-OTU-taxonomy.blast.biom', informat='json')

In [None]:
print table

In [None]:
unassigned_table = mb.BIOM_return_by_tax_level(taxlevel='unassigned', BIOM=table, invert=False)

In [None]:
print unassigned_table.metadata(axis='observation')

In [None]:
!ls ./GLOBAL/

In [None]:
mb.extract_fasta_by_BIOM_OTU_ids(in_fasta='./GLOBAL/global_queries.fasta', BIOM=unassigned_table,
                                out_fasta='./GLOBAL/unassigned_only.fasta')

In [None]:
!ls ./GLOBAL/

In [None]:
unassigned_table_notax = mb.drop_BIOM_taxonomy(unassigned_table)

In [None]:
print unassigned_table_notax.metadata(axis='observation')

In [None]:
mb.write_BIOM(BIOM=unassigned_table_notax, target_prefix='./GLOBAL/unassigned_only_denovo', outfmt=['json','tsv'])

In [None]:
!ls ./GLOBAL

In [None]:
cd ..

In [None]:
!mkdir unassigned_otu

In [None]:
!cp taxonomic_assignment/GLOBAL/u* ./unassigned_otu/

In [None]:
cd unassigned_otu/

In [None]:
%%bash

metaBEAT_global.py \
-B unassigned_only_denovo.biom \
--g_queries unassigned_only.fasta \
--cluster --clust_match 0.97 --clust_cov 2 \
--blast --blast_db ../../BLAST_DB-aug18/nt/nt --min_ident 0.97 --min_ali_length 0.8 \
-m COI -n 5 \
-E -v \
-@ M.Benucci@2015.hull.ac.uk \
-o CO1DvAug18-merge-forwonly_nonchimera_blast-unassigned_c97-cov2_blast-id97 &> log0.97

Trouble shooting notes:
An error comes up one or more taxids are present in the taxid files (gi_to_taxid.csv, gb_to_taxid.csv, taxid.txt), but it is not present in the taxonomy database that the current metaBEAT image contains for some reason. The taxonomy database in the current image can be brought up to date manually using:

Below not needed as we have updated the DB

In [None]:
!tail -n 50 log0.97

If the analysis gives an error with less taxa ID returned from the list provided, it means that the taxonomy database needs to be update. The script below will update the database for `taxtastic` package to run. It will take few minutes.

In [None]:
%%bash

metaBEAT_global.py --update_taxonomy