# Removing Human Contamination

This notebook will go through the workflow for removing human contamination in a microbiome. 

1. Human read removal by mapping to the human genome.


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/07_contam_removal"
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/07_contam_removal" >> config.sh
!echo "export XFILE_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data" >> config.sh
!echo "export FASTQ_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/06_qc_trimming/trimmed_reads" >> config.sh
!echo "export HUM_DB=/groups/bhurwitz/databases/chm13.draft_v1.0_plusY/chm13.draft_v1.0_plusY" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Mapping reads to the human genome

In this step, we will map all of our trimmed reads to the complete CHM13 human genome using Bowtie2. 

It is important to note that this alignment process is imperfect, and many human reads can fail to align from a microbiome. When we look at the taxonomic composition of our samples with Kraken2 later in this class, we will assess how well we did by comparing to a database that contains viruses, microbes, and human (post-human removal).

First, let's run bowtie to remove the majority of human reads. Let's write an sbatch script to align all of our trimmed reads to the human genome and remove those that align.

In [None]:
# Create a script to run bowtie2 to align reads to a human reference
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. bowtie2 runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $WORK_DIR
# 4. Notice that we are asking for alot more resource (24 cores and 5G of memory per core), we are also asking for more time (24 hours)
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-rem_human-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source $SLURM_SUBMIT_DIR/config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

### reads with human removed
BOWTIE_NAME="${WORK_DIR}/${SAMPLE_ID}_%.fastq.gz"
SAM_NAME="${WORK_DIR}/${SAMPLE_ID}_human_removed.sam"

### reads mapped to human
MET_NAME="${WORK_DIR}/${SAMPLE_ID}_hostmap.log"

apptainer run /contrib/singularity/shared/bhurwitz/bowtie2:2.5.1--py39h6fed5c7_2.sif bowtie2 \
    -p 24 -x $HUM_DB -1 $PAIR1 -2 $PAIR2 --un-conc-gz $BOWTIE_NAME 1> $SAM_NAME 2> $MET_NAME

rm $SAM_NAME
'''

with open('remove_human_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Check the code and make sure your script above was created.
!cat remove_human_parallel.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the remove_human_parallel.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run bowtie on each of your trimmed fastq files to remove human
# Remember that this may take a while to run, so take a break, and get a coffee.
!sbatch ./remove_human_parallel.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# Once your jobs have run (or are running) you can check the progress
# and also look for errors in the *out files
# For example, you can look at Job-rem_human-0.out
!ls
!cat Job-rem_human-0.out

In [None]:
# Double check that all of your files have run through the human screening
# Do you see a *_1.fastq.gz and *_2.fastq.gz file in the working directory?
# Of not, you will need to check your job out files above for clues about what went wrong.
!ls $work_dir

In [None]:
# Do the fastq files (post-human filter) look smaller than the ones that were trimmed?
# list the file sizes for fastq files post-human filtering
!ls -l $work_dir

In [None]:
# list the file sizes for fastq files before human filtering'
!ls -l /xdisk/bhurwitz/bh_class/$netid/assignments/06_qc_trimming/trimmed_reads

Great job! It looks like you have removed *most* of the human reads from your fastq files. We will double check this when we run kraken2 on the files to classify each of the reads by taxonomy.

Another quick note, you may need to remove additional contamination using the same approach. For example, the sequencing center may have use PhiX as a "spike-in" to assess the quality of the sequencing run with a known quantity of DNA. You can use the same approach as above to remove reads from any genome you think may be contaminating your sample.

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/07_contam_removal.ipynb $work_dir