# Removing Human Contamination

This notebook will go through the workflow for removing human contamination in a microbiome. 

-----------

Sections:

1. Remove all reads mapping to the human genome.

-----------


## Getting Started

Set the variables you need for running the analyses in this notebook.

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/07_contam_removal"
%cd $work_dir

In [None]:
# Set the fastq directory. This is where we have our trimmed fastq files.
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/06_qc_trimming/trimmed_reads"
xfile_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh
!echo "export XFILE_DIR=$xfile_dir" >> config.sh
!echo "export BOWTIE2=/contrib/singularity/shared/bhurwitz/bowtie2:2.5.1--py39h6fed5c7_2.sif" >> config.sh
!echo "export HUM_DB=/groups/bhurwitz/databases/chm13.draft_v1.0_plusY/chm13.draft_v1.0_plusY" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Mapping reads to the human genome

In this step, we will map all of our trimmed reads to the complete human genome using a tool called Bowtie2. 

It is important to note that this alignment process is imperfect, and many human reads can fail to align from a microbiome. When we look at the taxonomic composition of our samples with Kraken2 later in this class, we will assess how well we did by comparing to a database that contains viruses, microbes, and human (post-human removal).

First, let's run bowtie to remove the majority of human reads. Let's write a run script to align all of our trimmed reads to the human genome and remove those that align.

In [None]:
# Create a script to run bowtie2 to align reads to a human reference
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh`
# 2. bowtie2 runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $WORK_DIR
# 4. Notice that we are asking for alot more resource (24 cores and 5G of memory per core)
#    we are also asking for more time (24 hours)
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                         
#SBATCH --output=07A_remove_human-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source $SLURM_SUBMIT_DIR/config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

### reads with human removed
BOWTIE_NAME="${WORK_DIR}/${SAMPLE_ID}_%.fastq.gz"
SAM_NAME="${WORK_DIR}/${SAMPLE_ID}_human_removed.sam"

### reads mapped to human
MET_NAME="${WORK_DIR}/${SAMPLE_ID}_hostmap.log"

apptainer run ${BOWTIE2} bowtie2 \
    -p 24 -x $HUM_DB -1 $PAIR1 -2 $PAIR2 --un-conc-gz $BOWTIE_NAME 1> $SAM_NAME 2> $MET_NAME

rm $SAM_NAME
'''

with open('07A_remove_human.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the 07A_remove_human.sh script?
!pwd
!ls

In [None]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

# 07A_remove_human: first job - no dependencies
job1=$(sbatch 07A_remove_human.sh)
jid1=$(echo $job1 | sed 's/^Submitted batch job //')
echo $jid1

'''

with open('07_launch_pipeline.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Make the pipeline script executable
!chmod +x *.sh

In [None]:
# now let's run it!
!./07_launch_pipeline.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

### Time to wait...

Great job! You kicked off a script to remove *most* of the human reads from your fastq files. We will double check this when we run kraken2 on the files to classify each of the reads by taxonomy. But, for now, we just need to wait a short time for the josnb to finish running. Come back to this assignment in a few hours to run the hw07_check.ipynb notebook.

Before you go...another quick note, in the "real-world" you may need to remove additional contamination using the same approach. For example, the sequencing center may have use PhiX as a "spike-in" to assess the quality of the sequencing run with a known quantity of DNA. Or, you may have created a microbiome with a different "host". You can use the same approach as above to remove reads from any genome you think may be contaminating your sample.

## Final Step
Copy your notebook to the current working directory

In [None]:
!cp ~/be487-fall-2024/assignments/07_contam_removal/hw07_contam_removal.ipynb $work_dir