# Assigning function to reads

This notebook will go through the workflow for assigning function to reads in a microbiome using HUMAnN 3.

-----------

Sections:

1. Assign function to reads using HUMAnN 3.
2. Summarize the HUMAnN 3 for KEGG terms 

-----------


## Getting Started

Set the variables you need for running the analyses in this notebook.

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/16_function"
%cd $work_dir

In [None]:
# Set the fastq directory. This is where we have our fastq files with human contam removed.
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/07_contam_removal"
xfile_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh
!echo "export XFILE_DIR=$xfile_dir" >> config.sh
!echo "export HU3_DB=/xdisk/bhurwitz/databases/Humann3" >> config.sh
!echo "export MPA_DB=/xdisk/bhurwitz/databases/Metaphlan" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Running HUMANN3 to get functional potential for the reads

In this step, we will compare all of our trimmed/screened reads to the functional databases using HUMAnN. 

HUMAnN is a computational tool used for metagenomic functional profiling. It is part of the HUMAnN (HMP Unified Metabolic Analysis Network) family of tools and is designed to help researchers analyze and interpret the functional potential of microbial communities (metagenomes) derived from DNA sequencing data.

In [None]:
# Create a run script to run HUMAnN3
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                         
#SBATCH --output=16A_function-%a.out
#SBATCH --error=16A_function-%a.err
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=6G                                    

pwd; hostname; date
source $SLURM_SUBMIT_DIR/config.sh

#load environment
CONDA="/groups/bhurwitz/miniconda3"
source $CONDA/etc/profile.d/conda.sh
conda activate humann3_env 

names=($(cat $XFILE_DIR/$XFILE))
SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

ZPAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
gunzip ${ZPAIR1}
PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq

#No PAIR2 needed
#PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

OUT_DIR="$WORK_DIR/out_humann3/${SAMPLE_ID}"

if [[ ! -d "$OUT_DIR" ]]; then
        mkdir -p $OUT_DIR
fi

echo ${PAIR1}

#run humann
humann --input ${PAIR1} --input-format fastq \
    -o ${OUT_DIR} \
    --metaphlan-options="-t rel_ab --bowtie2db ${MPA_DB} --index mpa_vJun23_CHOCOPhlAnSGB_202403" \
    --nucleotide-database ${HU3_DB}/chocophlan --search-mode uniref90 \
    --protein-database ${HU3_DB}/uniref --threads 28

gzip ${PAIR1}

'''

with open('16A_read_function.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the 16A_function.sh script?
!pwd
!ls

In [None]:
# Create the 
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --output=16B_merge_humann3-%a.out
#SBATCH --error=16B_merge_humann3%a.err
#SBATCH --array=0-7 
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=6G                                    

pwd; hostname; date
source $SLURM_SUBMIT_DIR/config.sh

names=($(cat $XFILE_DIR/$XFILE))
SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

#load environment
CONDA="/groups/bhurwitz/miniconda3"
source $CONDA/etc/profile.d/conda.sh
conda activate humann3_env 

IN_DIR="$WORK_DIR/out_humann3/${SAMPLE_ID}"
OUT_DIR="$WORK_DIR/out_pathabundance/${SAMPLE_ID}"

if [[ ! -d "$OUT_DIR" ]]; then
        mkdir -p $OUT_DIR
fi

cd $IN_DIR

# normalize path abundance
for f in *_pathabundance.tsv
do
    humann_renorm_table --input $f --output cpm_$f --units cpm
    mv cpm_$f $OUT_DIR
done

# merge table and stratify

mkdir result_tables
humann_join_tables --input $OUT_DIR --output humann_pathabundance.tsv --file_name cpm_
humann_split_stratified_table --input humann_pathabundance.tsv --output result_tables
mv humann_pathabundance.tsv result_tables

OUT_DIR="$WORK_DIR/out_geneabundance/${SAMPLE_ID}"

cd $IN_DIR

if [[ ! -d "$OUT_DIR" ]]; then
        mkdir -p $OUT_DIR
fi

# normalize gene abundance and group by KEGG
for f in *_genefamilies.tsv
do
    Kf="${f%%.tsv}KEGG.tsv"
    humann_regroup_table --input $f --output $Kf --custom /xdisk/bhurwitz/databases/Humann3/utility_mapping/map_ko_uniref90.txt.gz
    humann_renorm_table --input $Kf --output cpm_$Kf --units cpm
    mv cpm_$Kf $OUT_DIR
done

# merge table and stratify

humann_join_tables --input $OUT_DIR --output humann_KOabundance.tsv --file_name cpm_
humann_split_stratified_table --input humann_KOabundance.tsv --output result_tables
mv humann_KOabundance.tsv result_tables

# all without norm
humann_join_tables --input . --output humann_KOnonnorm.tsv --file_name _genefamiliesKEGG.tsv
humann_split_stratified_table --input humann_KOnonnorm.tsv --output result_tables_nonnorm
mv humann_KOnonnorm.tsv result_tables_nonnorm

'''

with open('16B_merge_humann3.sh', mode='w') as file:
    file.write(my_code)


In [None]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

# 16A_read_function: first job - no dependencies
job1=$(sbatch 16A_read_function.sh)
jid1=$(echo $job1 | sed 's/^Submitted batch job //')
echo $jid1

# 06B_run_trimmomatic: jid2 depends on jid1
job2=$(sbatch --dependency=afterok:$jid1 16B_merge_humann3.sh)
jid2=$(echo $job2 | sed 's/^Submitted batch job //')
echo $jid2

'''

with open('16_launch_pipeline.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Make the pipeline script executable
!chmod +x *.sh

In [None]:
# now let's run it!
!./16_launch_pipeline.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

### Time to wait...

Great job! You kicked off a script to get functional annoation for your data. Now, you need to wait for this to complete. It should take an hour to run.

## Final Step
Copy your notebook to the current working directory

In [None]:
!cp ~/be487-fall-2024/assignments/16_function/hw16_function.ipynb $work_dir