## Functional annotation of reads

Using HUMAnN 3.0: (the HMP Unified Metabolic Analysis Network) method for efficiently and accurately profiling the abundance of microbial metabolic pathways and other molecular functions from metagenomic or metatranscriptomic sequencing data. It is appropriate for any type of microbial community

https://github.com/biobakery/biobakery/wiki/humann3#22-running-humann-the-basics

resource: https://huttenhower.sph.harvard.edu/humann/

In [None]:
#INSTALLATION
conda create --prefix /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3 python=3.7
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3
#set conda channel priority
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --add channels biobakery
conda install humann -c biobakery

In [None]:
#download databases to do demo run
humann_databases --download chocophlan DEMO humann_dbs
humann_databases --download uniref DEMO_diamond humann_dbs
#downloaded demo_seq file from: https://github.com/biobakery/biobakery/wiki/humann3
#run test with demo data to make sure it works
humann -i demo.fastq.gz -o sample_results (run this in a bash script)

In [None]:
#update databases
humann_databases --download utility_mapping full /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_dbs --update-config yes
humann_databases --download chocophlan full /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_dbs --update-config yes
humann_databases --download uniref uniref90_diamond /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_dbs --update-config yes
humann_databases --download uniref uniref50_diamond /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_dbs --update-config yes


https://github.com/biobakery/humann?tab=readme-ov-file#main-workflow

https://github.com/biobakery/biobakery/wiki/humann3#22-running-humann-the-basics

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/humann_prof/pstr/slurm-humann-%j.out  # %j = job ID

module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3

# Set parameters
#SAMPLENAME="pstr"
#SAMPLELIST="032024_pstr_sampleids.txt" 
#LISTPATH="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/"
DB="/scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_dbs"
READS="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/assembly/pstr/repaired"
INPUT="/scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_prof/pstr"
#mkdir -p $INPUT
OUT="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/humann_prof/pstr"
#mkdir -p $OUT

#concatenate F and R
#paired-end currently not taken into account during HUMAnN's alignment steps. The best way to use paired-end sequencing data with HUMAnN is simply to concatenate all reads into a single FASTA or FASTQ file. 
#Will concatenate the F and R reads for one sample to try.  
cat $READS/032024_COL_SAN_T5_144_PSTR_S9_host_removed_R1.tagged_filter_ready.fastq.gz $READS/032024_COL_SAN_T5_144_PSTR_S9_host_removed_R2.tagged_filter_ready.fastq.gz > $INPUT/032024_COL_SAN_T5_144_PSTR_S9_filter_ready_all.fastq.gz

#run on quality filtered and repaired reads that are concatenated
humann --input $INPUT/032024_COL_SAN_T5_144_PSTR_S9_filter_ready_all.fastq.gz --output $OUT/ 

conda deactivate

# JOB-ID: 29032744
# bash script file name: nikea/COL/bash_scripts/Col_humann_profiling.sh

notes: \
From humann manual: The best way to use paired-end sequencing data with HUMAnN is simply to concatenate all reads into a single FASTA or FASTQ file. Will use the concatenated seqs for each coral species first. \
However, another option could be to interleave the reads. resource: https://www.biostars.org/p/9469504/

Ran basch script from /scratch folder but had humann running in work. Temp files are large so will change to run everything in /scratch.

ran on one sample first to test, but might need to run on all samples concatenated together depending on the output (so far looks like metaphlan was unable to identify taxa in this one sample)


**Let's try all ofav samples and see if that helps!** this is following a meeting where it was decided that we should focus on ofav first

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_prof/ofav/take2/slurm-humann-%j.out  # %j = job ID

module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3

# Set parameters
#PATH="/scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_prof/ofav"

#concatenate F and R
#paired-end currently not taken into account during HUMAnN's alignment steps. The best way to use paired-end sequencing data with HUMAnN is simply to concatenate all reads into a single FASTA or FASTQ file. 
#Will concatenate the F and R reads of all samples.
#cat $READS/ofav_reads_R1_ALL.fastq.gz $READS/ofav_reads_R2_ALL.fastq.gz > $PATH/ofav_reads_ALL.fastq.gz

#run humann
humann --input humann_prof/ofav/ofav_reads_ALL.fastq.gz --output humann_prof/ofav/take2 --threads 11

conda deactivate

# JOB-ID: 29185909, take 2(29227386)
# bash script file name: nikea/COL/bash_scripts/Col_humann_profiling_ofav.sh

**downloaded the uniref50 database and re-run, hopefully this will help annotate the gene families!**

looking at data, using this resource: https://github.com/biobakery/biobakery/wiki/humann3 \
https://github.com/biobakery/humann?tab=readme-ov-file#humann_barplot

In [None]:
salloc -p cpu --mem 150G
#salloc: Granted job allocation 29241322
#salloc: Nodes uri-cpu001 are ready for job

In [None]:
module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3
#lets normalize to copies per million
humann_renorm_table --input humann_prof/ofav/take2/ofav_reads_ALL_genefamilies.tsv --output humann_prof/ofav/take2/ofav_reads_ALL_genefamilies_cpm.tsv --units cpm --update-snames
cd humann_prof/ofav/take2/
humann_regroup_table --input ofav_reads_ALL_genefamilies_cpm.tsv --output rxn-cpm.tsv --groups uniref90_rxn
#Original Feature Count: 13153; Grouped 1+ times: 465 (3.5%); Grouped 2+ times: 138 (1.0%)
#now lets attach some human-readable descriptions of these IDs to facilitate biological interpretation
humann_rename_table --input rxn-cpm.tsv --output rxn-cpm-named.tsv --names metacyc-rxn
#Renamed 554 of 556 entries (99.64%)


let's try using the assembled contigs rather than the filtered fastq.gz to see if that yields better taxonomic results! **It doesn't so don't do this.**

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_prof/ofav/ofav_contigs/slurm-humann-%j.out  # %j = job ID

module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3

# Set parameters
CONTIGSPATH="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/assembly/ofav/megahit_assembly"

#trying the ofav assembled contigs to see what difference it makes

#run humann
humann --input $CONTIGSPATH/ofav.contigs.fa --output humann_prof/ofav/ofav_contigs --threads 11

conda deactivate

# JOB-ID: 29250150
# bash script file name: nikea/COL/bash_scripts/Col_humann_profiling_ofav.sh

Concatenating each separate sample ID and running humann. Then will combine files at the end.

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 48:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_prof/ofav/slurm-humann-%j.out  # %j = job ID

module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3

# Set parameters
SAMPLENAME="ofav"
SAMPLELIST="032024_${SAMPLENAME}_sampleids.txt" 
LISTPATH="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/"
#DB="/scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_dbs"
READS="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/assembly/${SAMPLENAME}/repaired"

#work from scratch3 directory (submit script from there)

#concatenate F and R
while IFS= read -r SAMPLEID; do
cat $READS/"${SAMPLEID}"_host_removed_R1.tagged_filter_ready.fastq.gz $READS/"${SAMPLEID}"_host_removed_R2.tagged_filter_ready.fastq.gz > humann_prof/ofav/"${SAMPLEID}"_filter_ready_all.fastq.gz
 if [ $? -eq 0 ]; then
        echo "concatenation successful for sample: $SAMPLEID"
    else
        echo "encountered an error for sample: $SAMPLEID"
        exit 1
    fi
done < "$LISTPATH/${SAMPLELIST}"

#run humann on quality filtered and repaired reads that are concatenated
#run in scratch because the temp files are very big
cd humann_prof/"${SAMPLENAME}"

while IFS= read -r SAMPLEID; do
humann --input "${SAMPLEID}"_filter_ready_all.fastq.gz --output "${SAMPLEID}"_humann --threads 11
 if [ $? -eq 0 ]; then
        echo "humann profile completed for sample: $SAMPLEID"
    else
        echo "humann encountered an error for sample: $SAMPLEID"
        exit 1
    fi 
done < "$LISTPATH/${SAMPLELIST}"

conda deactivate

# JOB-ID: 29436670 (forgot to add more threads so first sample ran before I cancelled and re-ran)
# bash script file name: nikea/COL/bash_scripts/Col_humann_indiv.sh

metaphlan was unable to classify taxa in each sample. For this I either should concatenate all samples or go the kraken route

In [None]:
#formatting the table, combine samples into one table!
#move output tables from /scratch to /work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/humann_prof/ofav

humann_join_tables --input ./ --output combined_tables_genefamilies.tsv

humann_renorm_table --input combined_tables_genefamilies.tsv --output combined_tables_genefamilies_cpm.tsv --units cpm --update-snames

humann_regroup_table --input combined_tables_genefamilies_cpm.tsv --output rxn-combined_cpm.tsv --groups uniref50_rxn

humann_rename_table --input rxn-combined_cpm.tsv --output rxn-combined_cpm_named.tsv --names metacyc-rxn 

I think the next step is to concatenate all samples for each (mcav, dlab, pstr) and then compare across those

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_prof/mcav/slurm-humann-%j.out  # %j = job ID

module load conda/latest
conda activate /scratch3/workspace/nikea_ulrich_uml_edu-gtdb/envs/biobakery3

# Set parameters
READS="/work/pi_sarah_gignouxwolfsohn_uml_edu/nikea/COL/assembly/mcav"
NEWPATH="/scratch3/workspace/nikea_ulrich_uml_edu-gtdb/humann_prof/mcav"
mkdir -p $NEWPATH

#concatenate F and R
#paired-end currently not taken into account during HUMAnN's alignment steps. The best way to use paired-end sequencing data with HUMAnN is simply to concatenate all reads into a single FASTA or FASTQ file. 
#Will concatenate the F and R reads of all samples.
cat $READS/mcav_reads_R1_ALL.fastq.gz $READS/mcav_reads_R2_ALL.fastq.gz > $NEWPATH/mcav_reads_ALL.fastq.gz

#run humann
humann --input humann_prof/mcav/mcav_reads_ALL.fastq.gz --output humann_prof/mcav --threads 11

conda deactivate

# JOB-ID: 29464798
# bash script file name: nikea/COL/bash_scripts/Col_humann_profiling_mcav.sh

ran this for dlab (29465661; needed 32 hrs to run) and pstr (29465821)

In [None]:
#combined all species into one table genefamilies file (moved to same directory)
humann_join_tables -i ./ -o all_species_genefamilies.tsv

humann_renorm_table --input all_species_genefamilies.tsv --output all_species_genefamilies_cpm.tsv --units cpm --update-snames 

humann_regroup_table --input all_species_genefamilies_cpm.tsv --output all_species_rxn_cpm.tsv --groups uniref90_rxn

humann_rename_table --input all_species_rxn_cpm.tsv --output all_species_rxn_cpm_named.tsv --names metacyc-rxn 

#combine all species into one pathpathabundances file using join_tables command

#combine metaphlan bugs list for all 4 coral species
#https://github.com/biobakery/biobakery/wiki/metaphlan4
merge_metaphlan_tables.py dlab_reads_ALL_metaphlan_bugs_list.tsv mcav_reads_ALL_metaphlan_bugs_list.tsv ofav_reads_ALL_metaphlan_bugs_list.tsv pstr_reads_ALL_metaphlan_bugs_list.tsv > merged_metaphlan_abundance_table.txt

conda install -c biobakery hclust2

#for some reason this removes the headers to made sure to add them back in and re-upload
#this creates a species only abundance table (can adapt this for other taxonomic levels too)
grep -E "s__" merged_metaphlan_abundance_table.txt \
| grep -v "t__" \
| sed "s/^.*|//g" \
> merged_species_abundance_table_species.txt

#generate the heatmap, doesn't work yet

hclust2.py \
-i merged_species_abundance_table_species.txt \
-o metaphlan_abundance_heatmap_species.png \
--f_dist_f braycurtis \
--s_dist_f braycurtis \
--cell_aspect_ratio 0.5 \
--flabel_size 10 --slabel_size 10 \
--max_flabel_len 100 --max_slabel_len 100 \
--minv 0.1 \
--minv 0 --maxv 100

