### Illumina read processing and taxonomic classification of query sequences ###

We are using our custom pipeline [metaBEAT](https://github.com/HullUni-bioinformatics/metaBEAT) to process the Illumina data and taxonomically identify query sequences. 

For full reproducibility metaBEAT was run inside a docker container - [here](https://hub.docker.com/r/chrishah/metabeat/).

After initial read quality trimming, merging and clustering, query sequences are filtered to remove potential chimeras against a reference database composed of COI sequences of _Craongonyx pseudogracilis_, _Crangonyx floridanus_, _Crangonyx islandicus_, and the positive control taxa _Osmia bicornis_ (all downloaded from Genbank as described [here](https://localhost:8888/notebooks/C_floridanus/1-NCBI_references/C_floridanus_NCBI_reference.ipynb)). Taxonomic assignment was performed using BLAST and a lowest common ancestor (LCA) approach, as described in the paper.

The file `Querymap.txt` contains the sample IDs and the location of the Illumina read files, plus the barcodes and instructions to clip off the first number of bases of both the forward and reverse reads, in order to remove any primers and heterogeneity spacers.

The file `REFlist.txt` points towards the sequences included in the reference database.

In [None]:
!head Querymap.txt

We prepare the list for the reference database.

In [None]:
%%bash

for gb in $(ls -1 ../1-NCBI_references/*.gb | grep "refseqs" -v)
do
    echo -e "$gb\tgb"
done > REFlist.txt

In [None]:
!cat REFlist.txt

### Trimming, merging and clustering

In [None]:
!metaBEAT_global.py --version

In [None]:
!metaBEAT_global.py --help

We run metaBEAT to cluster and trim the sequences from the metabarcoding data.

In [None]:
!metaBEAT_global.py \
-Q Querymap.txt \
--trim_qual 30 \
--length_filter 313 \
--product_length 313 \
--merge --merged_only \
-m COI -n 5 -v -@ m.benucci@2015.hull.ac.uk > metaBEAT_trim.log

In [None]:
!head -n 100 metaBEAT_trim.log

In [None]:
!tail -n 100 metaBEAT_trim.log

### Chimera detection

We then run the chimera detection step using the `.gb` files we generated before.

In [None]:
pwd

In [None]:
!mkdir chimera_detection

In [None]:
cd chimera_detection

In [None]:
!ls ../../1-NCBI_references/

In [None]:
%%bash

#Write REFmap
for file in $(ls ../../1-NCBI_references/* | grep "NCBI_2018.gb$")
do
    echo -e "$file\tgb"
done > chimera_REFmap.txt

In [None]:
!cat chimera_REFmap.txt

In [None]:
%%bash

metaBEAT_global.py \
-R chimera_REFmap.txt \
-f \
-@ M.Benucci@2015.hull.ac.uk

In [None]:
!head refs.fasta

In [None]:
%%bash


for a in $(cut -f 1 ../Querymap.txt)
do
    if [ -s ../$a/$a\_trimmed.fasta ]
    then
        echo -e "\n### Detecting chimeras in $a ###\n"
        mkdir $a
        cd $a
        vsearch --uchime_ref ../../$a/$a\_trimmed.fasta --db ../refs.fasta \
        --nonchimeras $a-nonchimeras.fasta --chimeras $a-chimeras.fasta &> log 
        cd ..

    else
        echo -e "$a is empty"
    fi
done

### Blast of non-chimera sequences

In [None]:
cd ..

In [None]:
%%bash

#Querymap
for a in $(ls -l chimera_detection/ | grep "^d" | perl -ne 'chomp; @a=split(" "); print "$a[-1]\n"')
do
    echo -e "$a-nc\tfasta\tchimera_detection/$a/$a-nonchimeras.fasta"
done > nonchimera_Querymap.txt

In [None]:
!cat nonchimera_Querymap.txt

In [None]:
!sed '/GLOBAL/d' nonchimera_Querymap.txt > nonchimera_Querymap_final.txt

In [None]:
!cat nonchimera_Querymap_final.txt

We now run the BLAST search of the non-chimera sequences against the reference sequences we downloaded from literature and that were saved as `REFlist.txt`.
We clustered the sequences with 97% match, retaining clusters that have a minimum of 6 sequences in each. For the identity assignment we used a minimum identity of 97% with a minimum alignment lenght of 85%.

In [None]:
!metaBEAT_global.py \
-Q nonchimera_Querymap_final.txt \
-R REFlist.txt \
--cluster --clust_match 0.97 --clust_cov 6 \
--blast --min_ident 0.97 --min_ali_length 0.85 \
-m COI -n 5 -v -@ m.benucci@2015.hull.ac.uk \
-o COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref > metaBEAT.log

In [None]:
!head -n 100 metaBEAT.log

In [None]:
!tail -n 50 metaBEAT.log

The final OTU table can be found in the file: 

`GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-by-taxonomy-readcounts.blast.tsv`.

### Extracting now the sequences we are interested in.

In [None]:
import metaBEAT_global_misc_functions as mb

Identify samples which contained reads assigned to _C. floridanus_ before filtering.

In [None]:
mb.find_target(BIOM=mb.load_BIOM('GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-by-taxonomy-readcounts.blast.biom'), 
               target='Crangonyx_floridanus')

The OTU by taxonomy `GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-OTU-taxonomy.blast.tsv` shows _Crangonyx floridanus_ (_Cf_) and _Crangonyx pseudogracilis_ (_Cp_) OTUs contain 87.9% of total N of reads. 2 OTUs in particular contain 87.5% of the total N reads; respectively 1 OTU with 51.6% reads belonging to _Cf_, and 1 OTU with 35.9% belonging to _Cp_.

We filter the raw OTU table, and in a given sample we remove OTUs that were not supported by at least 2% of the reads.

In [None]:
#load raw OTU table
to_filter = mb.load_BIOM(table='GLOBAL/BLAST_0.97/COI_28062019_merged-only_nonchimera_cl97cov6_blast_min97_ali0.85_ref-OTU-taxonomy.blast.biom')

#filter at 1%
filtered = mb.filter_BIOM_by_per_sample_read_prop(BIOM=to_filter, min_prop=0.02)

#write to file
mb.write_BIOM(filtered, target_prefix='filtered' )

#collapse OTUs by taxonomy
filtered_collapsed = mb.collapse_biom_by_taxonomy(in_table=filtered)

#write to file
mb.write_BIOM(filtered_collapsed, target_prefix='filtered-collapsed')

we identify samples containing sequences assigned to _C. floridanus_.

In [None]:
mb.find_target(filtered_collapsed, target='Crangonyx_floridanus')

And we identify the samples containing sequences assigned to _C. pseudogracilis_.

In [None]:
mb.find_target(filtered_collapsed, target='Crangonyx_pseudogracilis')