# Classifying taxonomy COL_032024 seq data

Using GTDB-Tk: https://github.com/Ecogenomics/GTDBTk?tab=readme-ov-file
GTDB-Tk identifies a core set of genes in the MAG, and use these to compute a species phylogeny

**The gtdb reference database needs 100GB space so do not create conda env in /home directory. Either create the env in /work or /scratch**

I used this opportunity to create a /scratch directory with this documentation: https://docs.unity.rc.umass.edu/documentation/managing-files/hpc-workspace/

In [None]:
#create scratch directory
ws_allocate gtdb 30

#FOR LATER:to release scratch space when you don't need it anymore
ws_release gtdb

In [None]:
#INSTALLATION of gtdbtk in /work
cd /scratch3/workspace/nikea_ulrich_uml_edu-gtdb
module load conda/latest
mkdir -p /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs
conda create --prefix /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/taxonomy python=3.8
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/taxonomy
conda install -c conda-forge -c bioconda gtdbtk=2.4.0 

In [None]:
#conda package is bundled with a script that will automatically download, and extract the GTDB-Tk reference data, simply run:
download-db.sh
#This will take awhile -- massive database

https://ecogenomics.github.io/GTDBTk/commands/classify_wf.html

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=150G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 48:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/taxonomy/mcav/slurm-gtdb_taxonomy-%j.out  # %j = job ID  # %j = job ID

module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/taxonomy

# Set parameters
SAMPLENAME="mcav"
BINPATH="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/binning/${SAMPLENAME}/${SAMPLENAME}_DASTool_bins"
OUTDIR="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/taxonomy/${SAMPLENAME}"
mkdir -p $OUTDIR

#Run gtdb-tk
gtdbtk classify_wf -x fa --skip_ani_screen \
--genome_dir $BINPATH/ --out_dir $OUTDIR/gtdb_out

#This will process all genomes in the directory <my_genomes> using both bacterial and archaeal marker sets and place the results in <output_dir>.
#Genomes must be in FASTA format (gzip with the extension .gz is acceptable)

# JOB-ID:
# bash script file name: /nikea/COL/bash_scripts/Col_gtdb_taxonomy.sh

dlab job ID: 27123105 \
mcav job ID: 27159595 \
ofav job ID: 27159884 \
pstr job ID: 27159918

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=100G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/taxonomy/mcav/slurm-itol-%j.out  # %j = job ID  # %j = job ID

module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/taxonomy

SAMPLENAME="mcav"
WORKDIR="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/taxonomy/${SAMPLENAME}/gtdb_out/classify"
OUTDIR="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/taxonomy/${SAMPLENAME}/gtbd_trees_for_vis"
mkdir -p $OUTDIR

#convert all classify trees to itol readable format
gtdbtk convert_to_itol --input $WORKDIR/gtdbtk.bac120.classify.tree.4.tree --output $OUTDIR/itol.gtdbtk.classify.tree.4.tree

gtdbtk convert_to_itol --input $WORKDIR/gtdbtk.bac120.classify.tree.6.tree --output $OUTDIR/itol.gtdbtk.classify.tree.6.tree

gtdbtk convert_to_itol --input $WORKDIR/gtdbtk.bac120.classify.tree.8.tree --output $OUTDIR/itol.gtdbtk.classify.tree.8.tree

gtdbtk convert_to_itol --input $WORKDIR/gtdbtk.backbone.bac120.classify.tree --output $OUTDIR/itol.gtdbtk.backbone.classify.tree

# JOB-ID:27211859
# bash script file name: /nikea/COL/bash_scripts/Col_gtdb2itol_taxonomy.sh

*This above script could be included in the inital gtdbtk script for next use*

In [None]:
iqtree script for msa alignment here