Contacts
- Gilles Vergnaud (gilles.vergnaud@u-psud.fr)
- Jean-Philippe Vernadet (jean-philippe.vernadet@laposte.net)
- Kraken2
# clone or download files
git clone https://github.com/DerrickWood/kraken2.git
# launch install script
./install_kraken2.sh $KRAKEN2_DIR
More details in the manual of Kraken2.
- Conda
# get miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh;
./Miniconda3-latest-Linux-x86_64.sh
- b2forensics environment and scripts
# clone or download files
git clone https://github.com/i2bc/b2forensics.git
The first part of the pipeline uses a Kraken2 custom or standard database, a file with tRNA sequences and subunit ribosomal RNA sequences, and files of reference genomes of the species of interest.
- Kraken2 database
# standard database
kraken2-build --standard --db $DBNAME
As mentioned in the manual of Kraken2, it is possible to create a custom database, view this manual or go to How to use section.
-
BLAST database The pipeline needs a local BLAST database. Do not forget to indicate the BLAST database PATH in the config file. As explained on the "Get NCBI BLAST databases" page, it is possible to download a preformatted NCBI BLAST database. Use the script "update_blastdb.pl" from a blast+ package
-
tRNA sequences/subunit ribosomal sequences tRNA sequences are from tRNAdb, subunit ribosomal RNA sequences are from silva database. Concatenate these files and put it in tRNA_sequences directory.
-
reference genomes Download genomes fasta files of the species of interest and put in reference_genomes directory.
-
taxonomic IDs Make a file containing a list, one for each species of interest, with all taxonomic IDs you would like to include (one per line), and put it in taxonomy_files directory.
In data directory, you have a dataset to test the pipeline. With this pipeline, we will search for sequences that could be assigned to a taxid (Bacillus anthracis and sub taxid) from the list in the file taxonomy_files/taxonomy_tree_anthracis.txt.
-
BLAST database Just use the script "update_blastdb.pl" from a blast+ package.
-
tRNA sequences/subunit ribosomal sequences Get Bacteria tRNA genes sequences from tRNAdb with this request. Get SILVA_128_LSURef_tax_silva.fasta.gz and SILVA_128_SSURef_Nr99_tax_silva.fasta.gz from silva database. Concatenate these files (unzipped) and put it in tRNA_sequences directory in a file "tRNA_bacteria_with_silva-128_lsu_ssu.fa".
-
Kraken2 database
To create Kraken2 database, download fastas for assemblies "Complete Genome" for bacteria, archaea, fungi, protozoa, virus. We can use scripts from Mick Watson
# clone the git repo
git clone https://github.com/mw55309/Kraken_db_install_scripts.git
As explained in opiniomics post and with adjustment for kraken2
# run for each branch of life you wish to download
perl download_bacteria.pl
perl download_archaea.pl
perl download_fungi.pl
perl download_protozoa.pl
perl download_viral.pl
# build a new database
# download taxonomy
kraken2-build --download-taxonomy --db kraken2_db
# for each branch, add all fna in the directory to the database
for dir in fungi protozoa archaea viral bacteria; do
for fna in `ls $dir/*.fna`; do
kraken2-build --add-to-library $fna --db kraken2_db
done
done
# build the database
kraken2-build --build --db kraken2_db
- fill in the different paths in the config file
├── b2forensics_results
├── alignment_fastq
| ├── {sample}_{strain}_R1.fq.gz
| ├── {sample}_{strain}_R2.fq.gz
├── alignment_reads_id
| ├── {sample}_alignment_paired_reads_id_{strain}.txt
├── blast_reads_id
| ├── {sample}_{strain}_blast_output_uniq.txt
├── kraken_fasta
| ├── {sample}_{strain}_R1.fa
| ├── {sample}_{strain}_R2.fa
├── kraken_fastq
| ├── {sample}_{strain}_R1.fq.gz
| ├── {sample}_{strain}_R2.fq.gz
├── kraken_reads_id
| ├── {sample}_kraken_paired_reads_id_{strain}.txt
├── kraken_results
| ├── {sample}_cdb_paired.txt
├── megablast_results
| ├── {sample}_{strain}_blast_output_R1.txt
| ├── {sample}_{strain}_blast_output_R2.txt
| ├── {sample}_{strain}_blast_output_R1_filtered.txt
| ├── {sample}_{strain}_blast_output_R2_filtered.txt
├── trDNA_depleted
├── blast_alignment_{strain}
├── {sample}_aln_paired_trDNA_depleted_{strain}.sorted.bam