# Treponema genome analysis workflow - Grillova et al. 2019

## The environment

### Tested software versions and operation system

The workflows has been tested on Linux machine (Ubuntu 16.04) with Python 2.7.6, Python 3.4.3 using Conda 4.5.11, Jupyter notebook 5.7.2 and bash_kernel 0.7.1.

In [1]:
uname -a
lsb_release -a
python --version
python3 --version
pip3 --version
conda --version
echo "Jupyter notebook" `jupyter notebook --version`
echo "bash_kernel" `pip3 show bash_kernel | grep Version`

Linux BioDA-server 4.4.0-148-generic #174-Ubuntu SMP Tue May 7 12:20:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Python 2.7.12
Python 3.7.1
The program 'pip3' is currently not installed. To run 'pip3' please ask your administrator to install the package 'python3-pip'
conda 4.6.14
Jupyter notebook 4.4.1
The program 'pip3' is currently not installed. To run 'pip3' please ask your administrator to install the package 'python3-pip'
bash_kernel


## The analysis

First, we can activate the environment and check 

In [2]:
source activate treponema
# In case we want to export the system settings and software versions
conda info
conda list
# In case we modified the environment want to export it
#conda env export > treponema_mod.yml

(.env) (.env) 
     active environment : None
       user config file : /mnt/nfs/home/325073/000000-My_Documents/VM-home/.condarc
 populated config files : /mnt/nfs/home/325073/000000-My_Documents/VM-home/.condarc
          conda version : 4.6.14
    conda-build version : 3.17.6
         python version : 3.7.1.final.0
       base environment : /mnt/ssd/ssd_2/install/dir/anaconda  (writable)
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/free/linux-64
                          https://repo.anaconda.com/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://conda.anacon

: 1

## Running the workflow

### Input variables

First, we have to setup few variables which will be used throughout the analysis.

In [25]:
INPUT_DIR="/home/jan/Projects/treponema/data/raw"
OUTPUT_DIR="/home/jan/Projects/treponema/results"

THREADS=6

ADAPTER_R1="CTGTCTCTTATACACATCT"
ADAPTER_R2=$ADAPTER_R1

HOST_GENOME="/home/jan/Projects/treponema/data/references/GCF_000001405.36_GRCh38.p10_genomic.fna.gz" # ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_genomic.fna.gz
STRAINSEEKER_DB="/home/jan/Projects/treponema/data/references/strainseeker/ss_db_w32_4324"

REFERENCE="/home/jan/Projects/treponema/data/references/SS14.fa" # Bacteria reference genome

(treponema) (treponema) (treponema) (treponema) (treponema) (treponema) (treponema) (treponema) (treponema) (treponema) (treponema) (treponema) 

: 1

### Initial quality check

It is always a good idea to run an initial quality check on your data.

In [8]:
mkdir -p $OUTPUT_DIR/qc/fastqc/raw

fastqc --threads $THREADS --outdir $OUTPUT_DIR/qc/fastqc/raw $INPUT_DIR/*.gz

multiqc --outdir $OUTPUT_DIR/qc/fastqc/raw $OUTPUT_DIR/qc/fastqc/raw/ 

(treponema) (treponema) Started analysis of CW56_S1_R1_001.fastq.gz
Started analysis of CW56_S1_R2_001.fastq.gz
Approx 5% complete for CW56_S1_R1_001.fastq.gz
Approx 5% complete for CW56_S1_R2_001.fastq.gz
Approx 10% complete for CW56_S1_R1_001.fastq.gz
Approx 10% complete for CW56_S1_R2_001.fastq.gz
Approx 15% complete for CW56_S1_R1_001.fastq.gz
Approx 15% complete for CW56_S1_R2_001.fastq.gz
Approx 20% complete for CW56_S1_R1_001.fastq.gz
Approx 20% complete for CW56_S1_R2_001.fastq.gz
Approx 25% complete for CW56_S1_R1_001.fastq.gz
Approx 25% complete for CW56_S1_R2_001.fastq.gz
Approx 30% complete for CW56_S1_R1_001.fastq.gz
Approx 30% complete for CW56_S1_R2_001.fastq.gz
Approx 35% complete for CW56_S1_R2_001.fastq.gz
Approx 35% complete for CW56_S1_R1_001.fastq.gz
Approx 40% complete for CW56_S1_R2_001.fastq.gz
Approx 40% complete for CW56_S1_R1_001.fastq.gz
Approx 45% complete for CW56_S1_R2_001.fastq.gz
Approx 45% complete for CW56_S1_R1_001.fastq.gz
Approx 50% complete for CW

: 1

### Read preprocessing

Since we are working with DNA data and aiming for the results including the polymorphisms we should perform a careful preprocessing to remove the adapter sequences and perform quality trimming.

In [18]:
mkdir -p $OUTPUT_DIR/data/preprocessed
mkdir $OUTPUT_DIR/qc/cutadapt
mkdir $OUTPUT_DIR/qc/fastqc/preprocessed

cd $INPUT_DIR/

for sample in *R1*.gz
do
    FORWARD=$sample
    extension="${FORWARD##*R1}"
    REVERSE=${FORWARD%R1*}R2${extension}

    # Input file check
    if [ ! -f ${FORWARD} ]; then
    echo "Input file not found! First in pair."; echo ${FORWARD}
    fi
    if [ ! -f ${REVERSE} ]; then
    echo "Input file not found! Second in pair."; echo ${REVERSE}
    fi

    echo "Now I am processing ${FORWARD} as first in a pair and ${REVERSE} as a second in a pair."

    cutadapt -a $ADAPTER_R1 -A $ADAPTER_R2 \
    --times 1 --quality-cutoff 15,15 --trim-n \
    --error-rate 0.10 -O 3 --minimum-length 35 --max-n 0 \
    --output $OUTPUT_DIR/data/preprocessed/${FORWARD%$extension}.trimmed.fastq.gz \
    --paired-output $OUTPUT_DIR/data/preprocessed/${REVERSE%$extension}.trimmed.fastq.gz \
    $FORWARD $REVERSE &>$OUTPUT_DIR/qc/cutadapt/${FORWARD%_R1${extension}}.cutadapt.out

    echo "Done processing $FORWARD and $REVERSE"
done

multiqc --outdir $OUTPUT_DIR/qc/cutadapt $OUTPUT_DIR/qc/cutadapt/

fastqc --threads $THREADS --outdir $OUTPUT_DIR/qc/fastqc/preprocessed $OUTPUT_DIR/data/preprocessed/*.gz

multiqc --outdir $OUTPUT_DIR/qc/fastqc/preprocessed $OUTPUT_DIR/qc/fastqc/preprocessed/

(treponema) mkdir: cannot create directory ‘/home/jan/Projects/treponema/results/qc/cutadapt’: File exists
(treponema) mkdir: cannot create directory ‘/home/jan/Projects/treponema/results/qc/fastqc/preprocessed’: File exists
(treponema) (treponema) (treponema) (treponema) Now I am processing CW56_S1_R1_001.fastq.gz as first in a pair and CW56_S1_R2_001.fastq.gz as a second in a pair.
Done processing CW56_S1_R1_001.fastq.gz and CW56_S1_R2_001.fastq.gz
(treponema) (treponema) Started analysis of CW56_S1_R1.trimmed.fastq.gz
Started analysis of CW56_S1_R2.trimmed.fastq.gz
Approx 5% complete for CW56_S1_R1.trimmed.fastq.gz
Approx 5% complete for CW56_S1_R2.trimmed.fastq.gz
Approx 10% complete for CW56_S1_R1.trimmed.fastq.gz
Approx 10% complete for CW56_S1_R2.trimmed.fastq.gz
Approx 15% complete for CW56_S1_R1.trimmed.fastq.gz
Approx 15% complete for CW56_S1_R2.trimmed.fastq.gz
Approx 20% complete for CW56_S1_R1.trimmed.fastq.gz
Approx 20% complete for CW56_S1_R2.trimmed.fastq.gz
Approx 25% 

: 1

### Host genome contamination removal

Removal of the host genome DNA before the analysis speeds up the analysis and we will be working with much smaller files as well. 

In [24]:
# Generate a host genome DNA reference index - please run just once
mkdir $(dirname $HOST_GENOME)/bbmap_index

# If the host genome index does not exist create it
if [ ! -d "$(dirname $HOST_GENOME)/bbmap_index/$(basename $HOST_GENOME)" ]; then
    mkdir $(dirname $HOST_GENOME)/bbmap_index/$(basename $HOST_GENOME)
    bbmap.sh ref=$HOST_GENOME -Xmx20g path=$(dirname $HOST_GENOME)/bbmap_index/$(basename $HOST_GENOME)
fi

# Run the host genome reference removal
mkdir $OUTPUT_DIR/qc/bbmap

cd $OUTPUT_DIR/data/preprocessed/

for sample in *R1*trimmed.fastq.gz
do
    FORWARD=$sample
    extension="${FORWARD##*R1}"
    REVERSE=${FORWARD%R1*}R2${extension}

    # Input file check
    if [ ! -f ${FORWARD} ]; then
        echo "Input file not found! First in pair."; echo ${FORWARD}
    fi
    if [ ! -f ${REVERSE} ]; then
        echo "Input file not found! Second in pair."; echo ${REVERSE}
    fi

    echo "Now I am processing ${FORWARD} as first in a pair and ${REVERSE} as a second in a pair with reference $HOST_GENOME."

    # Start mapping
    bbmap.sh threads=$THREADS -Xmx25g minid=0.95 maxindel=3 bandwidthratio=0.16 \
    bandwidth=12 quickmatch fast minhits=2 path=$(dirname $HOST_GENOME)/bbmap_index/$(basename $HOST_GENOME) unpigz pigz \
    in=${FORWARD} in2=${REVERSE} outu=$OUTPUT_DIR/data/preprocessed/${FORWARD%_R1${extension}}.clean.fastq.gz \
    outm=$OUTPUT_DIR/data/preprocessed/${FORWARD%_R1${extension}}.dirty.fastq.gz &>$OUTPUT_DIR/qc/bbmap/${FORWARD%_R1${extension}}.bbmap.out # qtrim=rl trimq=10 untrim  # We already have preprocessed data, no need for this

    # De-interleave
    reformat.sh -Xmx12g verifypaired in=$OUTPUT_DIR/data/preprocessed/${FORWARD%_R1${extension}}.clean.fastq.gz \
    out1=$OUTPUT_DIR/data/preprocessed/${FORWARD%${extension}}.clean.fastq.gz out2=$OUTPUT_DIR/data/preprocessed/${REVERSE%${extension}}.clean.fastq.gz

    # Remove the host genome mapped reads (usefull for mapping precision)
    rm $OUTPUT_DIR/data/preprocessed/${FORWARD%_R1${extension}}.clean.fastq.gz
    rm $OUTPUT_DIR/data/preprocessed/${FORWARD%_R1${extension}}.dirty.fastq.gz
done

(treponema) mkdir: cannot create directory ‘/home/jan/Projects/treponema/data/references/bbmap_index’: File exists
(treponema) (treponema) (treponema) java -Djava.library.path=/home/jan/Tools/anaconda/envs/treponema/opt/bbmap-38.22-0/jni/ -ea -Xmx2g -cp /home/jan/Tools/anaconda/envs/treponema/opt/bbmap-38.22-0/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=/home/jan/Projects/treponema/data/references/GCF_000001405.36_GRCh38.p10_genomic.fna.gz -Xmx2g path=./bbmap_index/HOST_GENOME
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=/home/jan/Projects/treponema/data/references/GCF_000001405.36_GRCh38.p10_genomic.fna.gz, -Xmx2g, path=./bbmap_index/HOST_GENOME]
Version 38.22

No output file.
NOTE:	Deleting contents of ./bbmap_index/HOST_GENOME/ref/genome/1 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [/home/jan/Projects/treponema/data/references/GCF_000001405.36_GRCh38.p10_genomic.fna.gz, 1, wri

: 1

### Bacteria contamination scan (optional)

Before we start with the alignment we can quickly scan for possible bacterial contamination in our dataset. This scan uses default StrainSeeker database which is most likely outdated but StrainSeeker offers a possibility to generate your [own index](http://bioinfo.ut.ee/strainseeker/index.php?r=site/page&view=manual#database) for the scan with their builder script. One advanatage over tools such as [https://ccb.jhu.edu/software/kraken2/](Kraken2) (a great tool) is that is consumes much less RAM. However, the latest releases of MiniKraken2 could be used as well.  

In [None]:
mkdir $OUTPUT_DIR/qc/seeker

cd $OUTPUT_DIR/data/preprocessed/

for sample in *.clean.fastq.gz
do 
    echo "Working on sample $sample"

    echo "Subsampling"
    seqtk sample -s100 $sample 1000000 > $sample.sub # Subsample fastq

    echo "Scanning"
    perl seeker.pl -i $sample.sub -d $STRAINSEEKER_DB -o $OUTPUT_DIR/data/preprocessed/${sample%.fastq.gz}.seeker.txt

    rm $sample.sub
done

### Bacteria reference genome alignment

With host genome DNA cleaned data we can proceed to the alignment to the reference.

In [26]:
# Prepare reference indexes
bwa index $REFERENCE
samtools faidx $REFERENCE

mkdir -p $OUTPUT_DIR/qc/alignment_stats


for sample in *R1*clean.fastq.gz
    FORWARD=$sample
    extension="${FORWARD##*R1}"
    REVERSE=${FORWARD%R1*}R2${extension}

    cd $SCRATCH/

    # Input file check
    if [ ! -f ${FORWARD} ]; then
        echo "Input file not found! First in pair."; echo ${FORWARD}
    fi
    if [ ! -f ${REVERSE} ]; then
        echo "Input file not found! Second in pair."; echo ${REVERSE}
    fi

    # Start mapping
    echo "Now I am processing ${FORWARD} as first in a pair and ${REVERSE} as a second in a pair with reference $REF_DIR/$REF_SEQ"

    $BWA mem -t $THREADS -T $BWAMEMT -v 1 -M -R "@RG\tID:1\tLB:${sample%%.*}\tPL:Illumina\tSM:${sample%%.*}\tPU:${sample%%.*}" $SCRATCH/$REF_SEQ ${FORWARD} ${REVERSE} | $SAMTOOLS view -F 4 -@ $THREADS -b - | $SAMTOOLS sort -@ $THREADS - > $SCRATCH/alignment/${FORWARD%$APPENDIX1*}.${REF_SEQ%.*}.bam # Keep only aligned reads and fix mate information; should be try -k 23? This might decrease number of mismapped reads but who knows what it does to highly variable regions

    echo "Mapping finished"
    #rm $FORWARD $REVERSE

    cd $SCRATCH/alignment

    i=${FORWARD%$APPENDIX1*}.${REF_SEQ%.*}.bam

    $SAMTOOLS index -@ $THREADS $i # Index BAM files
    $SAMTOOLS flagstat $i > $SCRATCH/alignment/stats/${i%.*}.flagstat &
    $SAMTOOLS view -@ $THREADS -h -F 12 -f 2 -F 256 -b $i | $SAMTOOLS sort -n -@ $THREADS - | $SAMTOOLS fixmate -O bam - - | $SAMTOOLS sort -@ $THREADS - > ${i%.*}.filt.bam # -F 2048 = supplementary alignment, "chimeric/non-linear alignments"; -q $MAPQ
    $SAMTOOLS index -@ $THREADS ${i%.*}.filt.bam
    $SAMTOOLS flagstat ${i%.*}.filt.bam > $SCRATCH/alignment/stats/${i%.*}.filt.flagstat

    # Postprocessing
    # Create input seqence dictionary and index reference
    rm $SCRATCH/${REF_SEQ%.*}.dict
    $PICARD_RUN CreateSequenceDictionary R=$SCRATCH/$REF_SEQ O=$SCRATCH/${REF_SEQ%.*}.dict

    # Realign
    i=${i%.*}.filt.bam

    $GATK_RUN -T RealignerTargetCreator --num_threads $THREADS -R $SCRATCH/$REF_SEQ -I ${i} -o $SCRATCH/alignment/${i%.*}.forIndelRealigner.intervals # Prepare intervals
    $GATK_RUN -I ${i} -R $SCRATCH/$REF_SEQ -T IndelRealigner -LOD 2.5 --consensusDeterminationModel USE_SW -targetIntervals $SCRATCH/alignment/${i%.*}.forIndelRealigner.intervals -o $SCRATCH/alignment/${i%.*}.indelRealigned.bam # Run re-alignment

    $SAMTOOLS index -@ $THREADS ${i%.*}.indelRealigned.bam # Index BAM files

    rm $i
    rm $i*
    rm $SCRATCH/alignment/${i%.*}.indelRealigned.bai
    rm $SCRATCH/alignment/${i%.*}.forIndelRealigner.intervals

    # Remove duplicates
    i=${i%.*}.indelRealigned.bam

    mkdir $SCRATCH/picard_dup
    $PICARD_RUN MarkDuplicates INPUT=$i OUTPUT=${i%.*}.dedup.bam METRICS_FILE=$SCRATCH/picard_dup/${i%.*}.dedupStats.txt REMOVE_DUPLICATES=true OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 # Taggs ALL duplicates, PCR and optical and remove them

    $SAMTOOLS index -@ $THREADS ${i%.*}.dedup.bam
    $SAMTOOLS flagstat ${i%.*}.dedup.bam > $SCRATCH/alignment/stats/${i%.*}.dedup.flagstat

    rm $i
    rm $i*
done

[bwa_index] Pack FASTA... 0.01 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.18 seconds elapse.
[bwa_index] Update BWT... 0.01 sec
[bwa_index] Pack forward-only FASTA... 0.01 sec
[bwa_index] Construct SA from BWT and Occ... 0.09 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index /home/jan/Projects/treponema/data/references/SS14.fa
[main] Real time: 0.886 sec; CPU: 0.292 sec
(treponema) (treponema) 

: 1

### Post-alignment filtering

### Reference genome coverage

### Consensus genome generation

### BAM downsampling (optional)

### Variant call

### Variant annotation

### BAM to fastq

### *De novo* assembly - mapped reads

### *De novo* assembly - all reads (optional)

### Additional scaffolding

### Assembly Quality Check

### Outputs