### Illumina read processing and taxonomic classification of query sequences ###

We are using our custom pipeline [metaBEAT](https://github.com/HullUni-bioinformatics/metaBEAT) to process the Illumina data and taxonomically identify query sequences. 

For full reproducibility metaBEAT was run inside a docker container - [here](https://hub.docker.com/r/chrishah/metabeat/).

After initial read quality trimming, merging and clustering, query sequences are filtered to remove potential chimeras against a reference database composed of COI sequences of _Craongonyx pseudogracilis_, _Crangonyx floridanus_, _Crangonyx islandicus_, and the positive control taxa _Osmia bicornis_ (all downloaded from Genbank as described [here](https://localhost:8888/notebooks/C_floridanus/1-NCBI_references/C_floridanus_NCBI_reference.ipynb)). Taxonomic assignment was performed using BLAST and a lowest common ancestor (LCA) approach, as described in the paper.

The file `Querymap.txt` contains the sample IDs and the location of the Illumina read files, plus the barcodes and instructions to clip off the first number of bases of both the forward and reverse reads, in order to remove any primers and heterogeneity spacers.

The file `REFlist.txt` points towards the sequences included in the reference database.

In [1]:
!head Querymap.txt

neg-7-5-May	fastq	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R1_001.fastq.gz	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R2_001.fastq.gz	GTAGAGAG	GTAAGGAG	32	30
CH101-1-8-May	fastq	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R1_001.fastq.gz	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R2_001.fastq.gz	TCGCCTTA	CTAAGCCT	26	33
CH102-1-7-May	fastq	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R1_001.fastq.gz	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R2_001.fastq.gz	TCGCCTTA	AAGGAGTA	26	32
CH103-1-6-May	fastq	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R1_001.fastq.gz	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R2_001.fastq.gz	TCGCCTTA	ACTGCATA	26	31
CH104-1-5-May	fastq	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R1_001.fastq.gz	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R2_001.fastq.gz	TCGCCTTA	GTAAGGAG	26	30
CH105-1-4-May	fastq	../../Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R1_

We prepare the list for the reference database.

In [2]:
%%bash

for gb in $(ls -1 ../1-NCBI_references/*.gb | grep "refseqs" -v)
do
    echo -e "$gb\tgb"
done > REFlist.txt

In [3]:
!cat REFlist.txt

../1-NCBI_references/C.islandicus_NCBI_2018.gb	gb
../1-NCBI_references/Crangonyx_NCBI_2018.gb	gb
../1-NCBI_references/O.bicornis_NCBI_2018.gb	gb
../1-NCBI_references/Synurella_NCBI_2018.gb	gb


### Trimming, merging and clustering

In [4]:
!metaBEAT_global.py --version

0.97.11-global


In [5]:
!metaBEAT_global.py --help

usage: metaBEAT.py [-h] [-Q <FILE>] [-B <FILE>] [--g_queries <FILE>] [-v] [-s]
                   [-f] [-p] [-k] [-t] [-b] [-m <string>] [-n <INT>] [-E] [-e]
                   [--read_stats_off] [--PCR_primer <FILE>] [--bc_dist <INT>]
                   [--trim_adapter <FILE>] [--trim_qual <INT>] [--phred <INT>]
                   [--trim_window <INT>] [--read_crop <INT>]
                   [--trim_minlength <INT>] [--merge] [--product_length <INT>]
                   [--merged_only] [--forward_only] [--length_filter <INT>]
                   [--length_deviation <FLOAT>] [-R <FILE>] [--gb_out <FILE>]
                   [--rec_check] [--gb_to_taxid <FILE>] [--cluster]
                   [--clust_match <FLOAT>] [--clust_cov <INT>]
                   [--blast_db <PATH>] [--blast_xml <PATH>]
                   [--update_taxonomy] [--taxonomy_db <FILE>]
                   [--min_ident <FLOAT>] [--min_ali_length <FLOAT>]
                   [--bitscore_skim_LCA <FLOAT>] [--bitsc

We run metaBEAT to cluster and trim the sequences from the metabarcoding data.

In [6]:
!metaBEAT_global.py \
-Q Querymap.txt \
--trim_qual 30 \
--length_filter 313 \
--product_length 313 \
--merge --merged_only \
-m COI -n 5 -v -@ m.benucci@2015.hull.ac.uk > metaBEAT_trim.log

In [7]:
!head -n 100 metaBEAT_trim.log


metaBEAT - metaBarcoding and Environmental DNA Analyses tool
version: v.0.97.11-global


Wed Jul 24 08:49:52 2019

/usr/bin/metaBEAT_global.py -Q Querymap.txt --trim_qual 30 --length_filter 313 --product_length 313 --merge --merged_only -m COI -n 5 -v -@ m.benucci@2015.hull.ac.uk


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'm.benucci@2015.hull.ac.uk'


Parsing querylist file

Number of samples to process: 127
Sequence input format: defaultdict(<type 'int'>, {'fastq': 127})
Barcodes for demultiplexing provided for 127 samples
Cropping instructions provided for 127 samples


Wed Jul 24 08:49:52 2019


### DEMULTIPLEXING ###

assessing basic characteristics
data comes as paired end - ok
process_shortreads -1 /home/working/Seqrun_COI_Aug17/data_files/Gp-gutcont3_S3_L001_R1_001.fas

In [8]:
!tail -n 100 metaBEAT_trim.log


### READ CLIPPING ###

Cropping the first 26 bases off reads in file: pos-9-7-May_forward.paired.fastq.gz
Cropping the first 26 bases off reads in file: pos-9-7-May_forward.singletons.fastq.gz
Cropping the first 32 bases off reads in file: pos-9-7-May_reverse.paired.fastq.gz
Cropping the first 32 bases off reads in file: pos-9-7-May_reverse.singletons.fastq.gz

### MERGING READ PAIRS ###

merging paired-end reads with flash

flash pos-9-7-May_forward.paired.headclipped.fastq.gz pos-9-7-May_reverse.paired.headclipped.fastq.gz -M 313 -t 5 -p 33 -o pos-9-7-May -z
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]  
[FLASH] Input files:
[FLASH]     pos-9-7-May_forward.paired.headclipped.fastq.gz
[FLASH]     pos-9-7-May_reverse.paired.headclipped.fastq.gz
[FLASH]  
[FLASH] Output files:
[FLASH]     ./pos-9-7-May.extendedFrags.fastq.gz
[FLASH]     ./pos-9-7-May.notCombined_1.fastq.gz
[FLASH]     ./pos-9-7-May.notCombined_2.fastq.gz
[

### Chimera detection

We then run the chimera detection step using the `.gb` files we generated before.

In [9]:
pwd

u'/home/working/C_floridanus/2-metaBEAT'

In [10]:
!mkdir chimera_detection

mkdir: cannot create directory ‘chimera_detection’: File exists


In [11]:
cd chimera_detection

/home/working/C_floridanus/2-metaBEAT/chimera_detection


In [12]:
!ls ../../1-NCBI_references/

C_floridanus_NCBI_reference.ipynb  O.bicornis_NCBI_2018.gb
c-islandicus_accession.txt	   pos-taxa_accession.txt
C.islandicus_NCBI_2018.gb	   synurella_accession.txt
Crangonyx_accession.txt		   Synurella_NCBI_2018.gb
Crangonyx_NCBI_2018.gb


In [13]:
%%bash

#Write REFmap
for file in $(ls ../../1-NCBI_references/* | grep "NCBI_2018.gb$")
do
    echo -e "$file\tgb"
done > chimera_REFmap.txt

In [14]:
!cat chimera_REFmap.txt

../../1-NCBI_references/C.islandicus_NCBI_2018.gb	gb
../../1-NCBI_references/Crangonyx_NCBI_2018.gb	gb
../../1-NCBI_references/O.bicornis_NCBI_2018.gb	gb
../../1-NCBI_references/Synurella_NCBI_2018.gb	gb


In [15]:
%%bash

metaBEAT_global.py \
-R chimera_REFmap.txt \
-f \
-@ M.Benucci@2015.hull.ac.uk


metaBEAT - metaBarcoding and Environmental DNA Analyses tool
version: v.0.97.11-global


Wed Jul 24 13:25:05 2019

/usr/bin/metaBEAT_global.py -R chimera_REFmap.txt -f -@ M.Benucci@2015.hull.ac.uk


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'M.Benucci@2015.hull.ac.uk'


######## PROCESSING REFERENCE DATA ########


processing ../../1-NCBI_references/Crangonyx_NCBI_2018.gb (containing 62 records)

total number of valid records: 62


processing ../../1-NCBI_references/Synurella_NCBI_2018.gb (containing 3 records)

total number of valid records: 65


processing ../../1-NCBI_references/C.islandicus_NCBI_2018.gb (containing 4 records)

total number of valid records: 69


processing ../../1-NCBI_references/O.bicornis_NCBI_2018.gb (containing 2 records)

total number of valid records

Traceback (most recent call last):
  File "/usr/bin/metaBEAT_global.py", line 2575, in <module>
    out.write(BIOM_tables_per_method['OTU_denovo'].to_tsv()) #to_json('generaged by test', direct_io=out)
  File "/usr/local/lib/python2.7/dist-packages/biom/table.py", line 4027, in to_tsv
    observation_column_name)
  File "/usr/local/lib/python2.7/dist-packages/biom/table.py", line 1268, in delimited_self
    raise TableException("Cannot delimit self if I don't have data...")
biom.exception.TableException: Cannot delimit self if I don't have data...


In [16]:
!head refs.fasta

>AB513800|326578|Crangonyx floridanus
CCAGAGCTGTTGGTACAGCCTTAAGCATAATTATCCGAATCGAACTGGCAACACCAGGAAATATTATTGAAGACGATCAAATTTACAATGTTATAGTAACAGCCCATGCCTTTGTTATAATTTTTTTTATGGTTATACCCATCATGATTGGAGGGTTTGGTAACTGACTAGTGCCTCTTATATTAGGCAGCCCTGATATAGCATTTCCTCGAATAAATAATATAAGATTTTGATTATTACCTCCCTCGTTATGTCTTCTGCTTATAAGAAGTTTAATCGAAAGAGGGGTAGGAACAGGATGAACTGTCTACCCGCCTTTAGCATCTACAGCTGCTCATAGAGGTGCTTCTGTAGACTTAGCTATTTTCTCTCTTCACCTAGCAGGTGCCTCCTCTATTTTAGGTTCAATTAACTTTATTTCCACAGTAATAAATATACGAGTAAAAAATATATTAATAGACCAAATCCCTTTATTTGTTTGAGCTATTTTCTTCACTACTATTCTTCTTCTTCTTTCTTTACCTGTTCTAGCAGGAGCTATCACAATACTTTTAACAGACCGTAATCTCAATACATCATTCTTTGACCCTTCTGGGGGGGGTGACCCTATCTTGTAC
>AB513801|326578|Crangonyx floridanus
GCCTGAGCCAGAGCTGTTGGTACAGCCTTAAGCATAATTATCCGAATCGAACTGGCAACACCAGGAAATATTATTGAAGACGATCAAATTTACAATGTTATAGTAACAGCCCATGCCTTTGTTATAATTTTTTTTATGGTTATACCCATCATGATTGGAGGGTTTGGTAACTGACTAGTGCCTCTTATATTAGGCAGCCCTGATATAGCATTTCCTCGAATAAATAATATAAGATTTTGATTATTACCTCCCTCGTTATGTCTTCTGCTTATAAGAAGTTTAATCGAAAGAGGGGTAGGAACA

In [17]:
%%bash


for a in $(cut -f 1 ../Querymap.txt)
do
    if [ -s ../$a/$a\_trimmed.fasta ]
    then
        echo -e "\n### Detecting chimeras in $a ###\n"
        mkdir $a
        cd $a
        vsearch --uchime_ref ../../$a/$a\_trimmed.fasta --db ../refs.fasta \
        --nonchimeras $a-nonchimeras.fasta --chimeras $a-chimeras.fasta &> log 
        cd ..

    else
        echo -e "$a is empty"
    fi
done


### Detecting chimeras in neg-7-5-May ###


### Detecting chimeras in CH101-1-8-May ###


### Detecting chimeras in CH102-1-7-May ###


### Detecting chimeras in CH103-1-6-May ###


### Detecting chimeras in CH104-1-5-May ###


### Detecting chimeras in CH105-1-4-May ###


### Detecting chimeras in CH106-1-3-May ###


### Detecting chimeras in CH107-1-2-May ###


### Detecting chimeras in CH108-1-1-May ###


### Detecting chimeras in CH109-2-8-May ###


### Detecting chimeras in CH110-2-7-May ###


### Detecting chimeras in CH201-2-6-May ###


### Detecting chimeras in CH202-2-5-May ###


### Detecting chimeras in CH203-2-4-May ###


### Detecting chimeras in CH204-2-3-May ###


### Detecting chimeras in CH205-2-2-May ###


### Detecting chimeras in CH206-2-1-May ###


### Detecting chimeras in CH207-3-8-May ###


### Detecting chimeras in CH208-3-7-May ###


### Detecting chimeras in CH209-3-6-May ###


### Detecting chimeras in CH210-3-5-May ###


### Detecting chimeras in CH211-3-4

mkdir: cannot create directory ‘neg-7-5-May’: File exists
mkdir: cannot create directory ‘CH101-1-8-May’: File exists
mkdir: cannot create directory ‘CH102-1-7-May’: File exists
mkdir: cannot create directory ‘CH103-1-6-May’: File exists
mkdir: cannot create directory ‘CH104-1-5-May’: File exists
mkdir: cannot create directory ‘CH105-1-4-May’: File exists
mkdir: cannot create directory ‘CH106-1-3-May’: File exists
mkdir: cannot create directory ‘CH107-1-2-May’: File exists
mkdir: cannot create directory ‘CH108-1-1-May’: File exists
mkdir: cannot create directory ‘CH109-2-8-May’: File exists
mkdir: cannot create directory ‘CH110-2-7-May’: File exists
mkdir: cannot create directory ‘CH201-2-6-May’: File exists
mkdir: cannot create directory ‘CH202-2-5-May’: File exists
mkdir: cannot create directory ‘CH203-2-4-May’: File exists
mkdir: cannot create directory ‘CH204-2-3-May’: File exists
mkdir: cannot create directory ‘CH205-2-2-May’: File exists
mkdir: cannot create directory ‘CH206-2-1-

### Blast of non-chimera sequences

In [18]:
cd ..

/home/working/C_floridanus/2-metaBEAT


In [19]:
%%bash

#Querymap
for a in $(ls -l chimera_detection/ | grep "^d" | perl -ne 'chomp; @a=split(" "); print "$a[-1]\n"')
do
    echo -e "$a-nc\tfasta\tchimera_detection/$a/$a-nonchimeras.fasta"
done > nonchimera_Querymap.txt

In [20]:
!cat nonchimera_Querymap.txt

CH101-1-8-May-nc	fasta	chimera_detection/CH101-1-8-May/CH101-1-8-May-nonchimeras.fasta
CH101-pl1-1-1-Oct-nc	fasta	chimera_detection/CH101-pl1-1-1-Oct/CH101-pl1-1-1-Oct-nonchimeras.fasta
CH102-1-7-May-nc	fasta	chimera_detection/CH102-1-7-May/CH102-1-7-May-nonchimeras.fasta
CH102-pl1-1-2-Oct-nc	fasta	chimera_detection/CH102-pl1-1-2-Oct/CH102-pl1-1-2-Oct-nonchimeras.fasta
CH103-1-6-May-nc	fasta	chimera_detection/CH103-1-6-May/CH103-1-6-May-nonchimeras.fasta
CH103-pl1-1-3-Oct-nc	fasta	chimera_detection/CH103-pl1-1-3-Oct/CH103-pl1-1-3-Oct-nonchimeras.fasta
CH104-1-5-May-nc	fasta	chimera_detection/CH104-1-5-May/CH104-1-5-May-nonchimeras.fasta
CH104-pl1-1-4-Oct-nc	fasta	chimera_detection/CH104-pl1-1-4-Oct/CH104-pl1-1-4-Oct-nonchimeras.fasta
CH105-1-4-May-nc	fasta	chimera_detection/CH105-1-4-May/CH105-1-4-May-nonchimeras.fasta
CH105-pl1-1-5-Oct-nc	fasta	chimera_detection/CH105-pl1-1-5-Oct/CH105-pl1-1-5-Oct-nonchimeras.fasta
CH106-1-3-May-nc	fasta	chimera_detection/CH106-1-3-May/CH106

In [21]:
!sed '/GLOBAL/d' nonchimera_Querymap.txt > nonchimera_Querymap_final.txt

In [22]:
!cat nonchimera_Querymap_final.txt

CH101-1-8-May-nc	fasta	chimera_detection/CH101-1-8-May/CH101-1-8-May-nonchimeras.fasta
CH101-pl1-1-1-Oct-nc	fasta	chimera_detection/CH101-pl1-1-1-Oct/CH101-pl1-1-1-Oct-nonchimeras.fasta
CH102-1-7-May-nc	fasta	chimera_detection/CH102-1-7-May/CH102-1-7-May-nonchimeras.fasta
CH102-pl1-1-2-Oct-nc	fasta	chimera_detection/CH102-pl1-1-2-Oct/CH102-pl1-1-2-Oct-nonchimeras.fasta
CH103-1-6-May-nc	fasta	chimera_detection/CH103-1-6-May/CH103-1-6-May-nonchimeras.fasta
CH103-pl1-1-3-Oct-nc	fasta	chimera_detection/CH103-pl1-1-3-Oct/CH103-pl1-1-3-Oct-nonchimeras.fasta
CH104-1-5-May-nc	fasta	chimera_detection/CH104-1-5-May/CH104-1-5-May-nonchimeras.fasta
CH104-pl1-1-4-Oct-nc	fasta	chimera_detection/CH104-pl1-1-4-Oct/CH104-pl1-1-4-Oct-nonchimeras.fasta
CH105-1-4-May-nc	fasta	chimera_detection/CH105-1-4-May/CH105-1-4-May-nonchimeras.fasta
CH105-pl1-1-5-Oct-nc	fasta	chimera_detection/CH105-pl1-1-5-Oct/CH105-pl1-1-5-Oct-nonchimeras.fasta
CH106-1-3-May-nc	fasta	chimera_detection/CH106-1-3-May/CH106

We now run the BLAST search of the non-chimera sequences against the reference sequences we downloaded from literature and that were saved as `REFlist.txt`.
We clustered the sequences with 97% match, retaining clusters that have a minimum of 6 sequences in each. For the identity assignment we used a minimum identity of 97% with a minimum alignment lenght of 85%.

In [41]:
!metaBEAT_global.py \
-Q nonchimera_Querymap_final.txt \
-R REFlist.txt \
--cluster --clust_match 0.97 --clust_cov 6 \
--blast --min_ident 0.97 --min_ali_length 0.85 \
-m COI -n 5 -v -@ m.benucci@2015.hull.ac.uk \
-o COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref > metaBEAT.log

In [42]:
!head -n 100 metaBEAT.log


metaBEAT - metaBarcoding and Environmental DNA Analyses tool
version: v.0.97.11-global


Wed Jul 24 17:45:42 2019

/usr/bin/metaBEAT_global.py -Q nonchimera_Querymap_final.txt -R REFlist.txt --cluster --clust_match 0.97 --clust_cov 6 --blast --min_ident 0.97 --min_ali_length 0.85 -m COI -n 5 -v -@ m.benucci@2015.hull.ac.uk -o COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref


metaBEAT may be querying NCBI's Entrez databases to fetch/verify taxonomic ids. Entrez User requirements state that you need to identify yourself by providing an email address so that NCBI can contact you in case there is a problem.

You have specified: 'm.benucci@2015.hull.ac.uk'

taxonomy.db found at /usr/bin/taxonomy.db
Taxonomy database is 0 days old - good!

Parsing querylist file

Number of samples to process: 127
Sequence input format: defaultdict(<type 'int'>, {'fasta': 127})
Barcodes for demultiplexing provided for 0 samples
Cropping instructions provided for 0 s

In [43]:
!tail -n 50 metaBEAT.log

assigned LCA Crangonyx floridanus (taxid 326578) at level species

attempting LCA assignment for CH103-pl1-1-3-Oct-nc|1_1109_8180_18246_1_ex
found LCA 326578 at level species
assigned LCA Crangonyx floridanus (taxid 326578) at level species

attempting LCA assignment for CH119-pl1-3-3-Oct-nc|1_1103_9815_11448_1_ex
found LCA 326578 at level species
assigned LCA Crangonyx floridanus (taxid 326578) at level species

attempting LCA assignment for CH505-pl1-5-3-Oct-nc|1_2111_20094_14700_1_ex
found LCA 326578 at level species
assigned LCA Crangonyx floridanus (taxid 326578) at level species

attempting LCA assignment for CH105-1-4-May-nc|1_2105_9715_15085_1_ex
found LCA 326568 at level species
assigned LCA Crangonyx pseudogracilis (taxid 326568) at level species

attempting LCA assignment for CH103-1-6-May-nc|1_2106_20229_21512_1_ex
found LCA 326568 at level species
assigned LCA Crangonyx pseudogracilis (taxid 326568) at level species

attempting LCA assignment for pcr-

The final OTU table can be found in the file: 

`GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-by-taxonomy-readcounts.blast.tsv`.

### Extracting now the sequences we are interested in.

In [45]:
import metaBEAT_global_misc_functions as mb

Identify samples which contained reads assigned to _C. floridanus_ before filtering.

In [46]:
mb.find_target(BIOM=mb.load_BIOM('GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-by-taxonomy-readcounts.blast.biom'), 
               target='Crangonyx_floridanus')


Specified BIOM input format 'json' - ok!
CH-BL1-pl1-7-3-Oct-nc.blast	(90.0383 %)
CH-BL2-pl1-7-4-Oct-nc.blast	(88.2554 %)
CH101-1-8-May-nc.blast	(99.5666 %)
CH101-pl1-1-1-Oct-nc.blast	(99.7500 %)
CH102-1-7-May-nc.blast	(99.4167 %)
CH102-pl1-1-2-Oct-nc.blast	(99.6956 %)
CH103-1-6-May-nc.blast	(0.0644 %)
CH103-pl1-1-3-Oct-nc.blast	(98.9692 %)
CH104-1-5-May-nc.blast	(99.6886 %)
CH104-pl1-1-4-Oct-nc.blast	(95.2475 %)
CH105-1-4-May-nc.blast	(0.1071 %)
CH105-pl1-1-5-Oct-nc.blast	(99.6454 %)
CH106-1-3-May-nc.blast	(0.0627 %)
CH106-pl1-1-6-Oct-nc.blast	(98.1265 %)
CH107-1-2-May-nc.blast	(98.9977 %)
CH107-pl1-1-7-Oct-nc.blast	(99.8794 %)
CH108-1-1-May-nc.blast	(99.8094 %)
CH108-pl1-1-8-Oct-nc.blast	(99.6482 %)
CH109-2-8-May-nc.blast	(99.6020 %)
CH109-pl1-2-1-Oct-nc.blast	(94.7490 %)
CH110-2-7-May-nc.blast	(99.8381 %)
CH110-pl1-2-2-Oct-nc.blast	(94.2847 %)
CH111-pl1-2-3-Oct-nc.blast	(21.1116 %)
CH112-pl1-2-4-Oct-nc.blast	(97.5969 %)
CH113-pl1-2-5-Oct-nc.blast	(89.8442 %)
CH114-pl1-2-6-Oct-nc.bla

The OTU by taxonomy `GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-OTU-taxonomy.blast.tsv` shows _Crangonyx floridanus_ (_Cf_) and _Crangonyx pseudogracilis_ (_Cp_) OTUs contain 87.9% of total N of reads. 2 OTUs in particular contain 87.5% of the total N reads; respectively 1 OTU with 51.6% reads belonging to _Cf_, and 1 OTU with 35.9% belonging to _Cp_.

We filter the raw OTU table, and in a given sample we remove OTUs that were not supported by at least 2% of the reads.

In [50]:
#load raw OTU table
to_filter = mb.load_BIOM(table='GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-OTU-taxonomy.blast.biom')

#filter at 1%
filtered = mb.filter_BIOM_by_per_sample_read_prop(BIOM=to_filter, min_prop=0.02)

#write to file
mb.write_BIOM(filtered, target_prefix='filtered' )

#collapse OTUs by taxonomy
filtered_collapsed = mb.collapse_biom_by_taxonomy(in_table=filtered)

#write to file
mb.write_BIOM(filtered_collapsed, target_prefix='filtered-collapsed')


Specified BIOM input format 'json' - ok!

Filtering at level: 2.0 %

Removing 610 OTUs for lack of support

Writing 'filtered.biom'
Writing 'filtered.tsv'
Writing 'filtered-collapsed.biom'
Writing 'filtered-collapsed.tsv'


we identify samples containing sequences assigned to _C. floridanus_.

In [51]:
mb.find_target(filtered_collapsed, target='Crangonyx_floridanus')

CH-BL1-pl1-7-3-Oct-nc.blast	(90.7552 %)
CH-BL2-pl1-7-4-Oct-nc.blast	(88.2554 %)
CH101-1-8-May-nc.blast	(100.0000 %)
CH101-pl1-1-1-Oct-nc.blast	(100.0000 %)
CH102-1-7-May-nc.blast	(100.0000 %)
CH102-pl1-1-2-Oct-nc.blast	(100.0000 %)
CH103-pl1-1-3-Oct-nc.blast	(100.0000 %)
CH104-1-5-May-nc.blast	(100.0000 %)
CH104-pl1-1-4-Oct-nc.blast	(100.0000 %)
CH105-pl1-1-5-Oct-nc.blast	(100.0000 %)
CH106-pl1-1-6-Oct-nc.blast	(100.0000 %)
CH107-1-2-May-nc.blast	(100.0000 %)
CH107-pl1-1-7-Oct-nc.blast	(100.0000 %)
CH108-1-1-May-nc.blast	(100.0000 %)
CH108-pl1-1-8-Oct-nc.blast	(100.0000 %)
CH109-2-8-May-nc.blast	(100.0000 %)
CH109-pl1-2-1-Oct-nc.blast	(95.6085 %)
CH110-2-7-May-nc.blast	(100.0000 %)
CH110-pl1-2-2-Oct-nc.blast	(94.3920 %)
CH111-pl1-2-3-Oct-nc.blast	(21.4096 %)
CH112-pl1-2-4-Oct-nc.blast	(97.7373 %)
CH113-pl1-2-5-Oct-nc.blast	(90.3129 %)
CH114-pl1-2-6-Oct-nc.blast	(80.7732 %)
CH115-pl1-2-7-Oct-nc.blast	(95.3055 %)
CH116-pl1-2-8-Oct-nc.blast	(89.5879 %)
CH117-pl1-3-1-Oct-nc.blast	(96.1352 

And we identify the samples containing sequences assigned to _C. pseudogracilis_.

In [52]:
mb.find_target(filtered_collapsed, target='Crangonyx_pseudogracilis')

CH103-1-6-May-nc.blast	(100.0000 %)
CH105-1-4-May-nc.blast	(100.0000 %)
CH106-1-3-May-nc.blast	(100.0000 %)
CH109-pl1-2-1-Oct-nc.blast	(4.3915 %)
CH110-pl1-2-2-Oct-nc.blast	(5.6080 %)
CH111-pl1-2-3-Oct-nc.blast	(78.5904 %)
CH112-pl1-2-4-Oct-nc.blast	(2.2627 %)
CH113-pl1-2-5-Oct-nc.blast	(9.6871 %)
CH114-pl1-2-6-Oct-nc.blast	(16.3531 %)
CH115-pl1-2-7-Oct-nc.blast	(4.6945 %)
CH116-pl1-2-8-Oct-nc.blast	(10.4121 %)
CH117-pl1-3-1-Oct-nc.blast	(3.8648 %)
CH118-pl1-3-2-Oct-nc.blast	(4.5428 %)
CH119-pl1-3-3-Oct-nc.blast	(12.5749 %)
CH201-2-6-May-nc.blast	(100.0000 %)
CH203-2-4-May-nc.blast	(77.1784 %)
CH204-2-3-May-nc.blast	(93.0657 %)
CH205-2-2-May-nc.blast	(100.0000 %)
CH206-2-1-May-nc.blast	(7.3248 %)
CH209-3-6-May-nc.blast	(2.2105 %)
CH301-pl1-3-4-Oct-nc.blast	(4.5411 %)
CH302-pl1-3-5-Oct-nc.blast	(70.9771 %)
CH303-pl1-3-6-Oct-nc.blast	(16.4670 %)
CH304-pl1-3-7-Oct-nc.blast	(5.2260 %)
CH305-pl1-3-8-Oct-nc.blast	(5.2702 %)
CH306-pl1-4-1-Oct-nc.blast	(9.2920 %)
CH307-pl1-4-2-Oct-nc.blast	(14