# RNA-seq Processing Pipeline

Jennifer Stiens
j.j.stiens@gmail.com
Birkbeck, University of London

## Date:  11-05-23

### Notebook for download, QC and mapping of RNA-seq files

The details of the RNA-seq processing and mapping performed for the WGCNA paper are found in the github repo for the paper:

[WGCNA rna processing doc](https://github.com/jenjane118/mtb_wgcna/blob/master/mtb_wgcna_doc.Rmd)

#### There are two options for each step in the pipeline: using snakemake and associated snakefiles, and the other using command line scripts

For help installing Snakemake:

[snakemake installation conda/mamba](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)

In [None]:

#install snakemake (install mamba first or install mamba inside conda)

conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake
mamba activate snakemake
mamba install -c bioconda bwa samtools fastqc multiqc fastp rseqc sra-tools deeptools

My snakemake files are found at https://github.com/jenjane118/rna_seq_snakemake. Copy these into your own snakemake folder.

Directory structure to show snakemake scripts

```
├── README.md
├── bam_coverage
│   └── snakefile.smk
├── bowtie2
│   └── snakefile.smk
├── dir_tree.txt
├── fastp
│   ├── pe
│   │   └── snakefile.smk
│   └── single
│       └── snakefile.smk
├── fastqc
│   └── pe
│       └── snakefile.smk
├── map_bwa
│   ├── pe
│   │   └── snakefile.smk
│   └── single
│       └── snakefile.smk
├── mbovis_wgs.ipynb
├── rna_seq_nb.ipynb
├── sra
│   ├── pe
│   │   └── snakefile.smk
│   └── single
│       └── snakefile.smk
└── tree_out.txt

14 directories, 14 files
```



## Download files from SRA to Birkbeck server

This uses SRA tools which is installed on thoth /s/software/modules

You may want to create a directory 'ncbi' or use the project name or something like this for your fastq files. Run the snakefile or shell script below from inside this directory.

In [None]:
module load ncbi-sra/v2.10.5 #(in /s/software/modules)
cd ncbi/<dataset_name>
#make shell script to iterate through accession numbers (iterate_fasterq.sh)
#!/bin/bash

while IFS= read -r line;
do
	echo "accession number: 	$line"
	#call fasterq to download from sra
	fasterq-dump ${line} -O files/
	echo -e "########################\n\n"
done < "$1"


# to run program in background:
nohup bash iterate_fasterq.sh accession_list.txt &> fasterq_dump.out &

In [None]:
# if using snakemake

# depends whether single or paired-end to choose which script

# make a directory for the dataset and move to that directory
mkdir $my_path/mtb_rna/PRJNA838962
cd $my_path/mtb_rna/PRJNA838962

#make config.yaml file in directory including something like the following line to indicate accessions:
#accession: [SRR21026195,SRR21026196,SRR21026197,SRR21026198,SRR21026199,SRR21026200]

conda activate snakemake
module load ncbi-sra/v2.10.5

#dry run
snakemake -np -s $my_path/snakemake/sra/pe/snakefile.smk
#run in background
nohup snakemake --cores 8 -s $my_path/snakemake/sra/pe/snakefile.smk > nohup.out 2>&1 &

After determining fastq files have been downloaded for the desired accession numbers, perform some sanity checks to look for discrepencies in number of reads (between paired end files) and for appearance and read length. The files will be in compressed form and there is no need to decompress them at this time. (The line count is included in snakemake script)


In [None]:
#Sanity checks

#1) Check for read length

head -50 <file.fastq.gz>

#2) Count number of reads:  R1 and R2 should match
zcat <file.fastq.gz> | wc -l

# or loop through and count reads:
FILES=`ls *.fastq.gz`
for file in $FILES; do zcat $file | wc -l; done

#or (for uncompressed files)
find . -name '*.fastq' -exec wc -l {} +

Choose what quality control program(s) to use. FastQC and Fastp equally useful for qc, but fastp trims at the same time. I prefer this

To run fastQC on directory of fastq files, create the following bash script and run:


In [None]:

#!/bin/bash

# iterate_fastqc.sh
# usage: bash iterate_fastqc.sh

FILES=*.fastq

for file in $FILES
do
	filename=$(basename "$file")
	filename="${filename%.*}"

	echo "File on the loop: 	$filename"

	#call fastQC quality analysis
	/s/software/fastqc/v0.11.8/FastQC/fastqc ${file}

	echo -e "########################\n\n"
done

# Run MultiQC
# -f overwrites existing files, . runs with files in current directory, -o output directory
echo "Running MultiQC..."
# Moves output into new folder
    mkdir ./fast_QC_outputs
    mv *fastqc.zip ./fast_QC_outputs
    mv *fastqc.html ./fast_QC_outputs

    # Run multiqc to compile outputs
    cd fast_QC_outputs
multiqc -f .


In [None]:
module load python/v3
module load fastqc
bash iterate_fastqc.sh

In [None]:
# with snakemake
cd $my_path/<dataset_dir>
conda activate snakemake
#dry run (use appropriate single/pe file depending on data)
snakemake -np -s $my_path/snakemake/fastqc/pe/snakefile.smk
#run in background
nohup snakemake --cores 8 -s $my_path/snakemake/fastqc/pe/snakefile.smk > nohup.out 2>&1 &

In the WGCNA paper, we then used trimmomatic to trim the adapters. I don't think this matters as long as good mapping stats.

In [None]:

#The following is a script to run trimmomatic on single end samples:
        #!/bin/bash
        # iterate_trimmomatic.sh
        # Runs Trimmomatic in PE mode for all sample names given as arguments
        # Run as:
        # nohup bash $my_path/scripts/iterate_trimmomatic.sh PRJNA488546


        timestamp=`date "+%Y%m%d-%H%M%S"`
        logfile="run_$timestamp.log"
        exec > $logfile 2>&1  #all output will be logged to logfile

        TRIM_EXEC="/s/software/trimmomatic/Trimmomatic-0.38/trimmomatic-0.38.jar"
        DIR=$1
        shift

        echo "Running Trimmomatic using executable: $TRIM_EXEC"

        for file in `ls $DIR/*.fastq.gz` ;
        do
          echo "File on Loop: ${file}"
          sample=${file/$DIR\/}
          sample=${sample/.fastq.gz/}
          echo "Sample= $sample"
          
          java -jar $TRIM_EXEC SE -threads 12 -phred33 \
               -trimlog "$sample"_trim_report.txt \
               "$DIR/$sample".fastq.gz "$sample"_trimmed.fastq.gz \
               ILLUMINACLIP:/s/software/trimmomatic/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 \
               LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

          gzip "$sample"_trim_report.txt
        done
        


In [None]:
        
module load trimmomatic
cd $my_path/ncbi/files/<dataset_dir>
nohup bash $my_path/scripts/iterate_trimmomatic.sh PRJNA488546 >& iterate_trim.out &

Lately I have switched to fastp to trim and do quality control. It trims adapters and automatically detects the adapter sequences by default. If there are remaining adapters, you can specify adapter sequences.

[fastp docs](https://github.com/OpenGene/fastp)

In [None]:
# paired end, gzip compressed, R1 is read1, R2 is read2 of paired end
mkdir trimmed
#fastp is not yet on thoth server--need to ask Dave to install this, or can use in conda env. But the code below is representative of how to use it.
module load fastp
fastp -i <sample_name.R1.fastq.gz> -I <sample_name.R2.fastq.gz> -o <trimmed_reads/<sample_name>_trimmed.R1.fastq.gz> -O <trimmed_reads/<sample_name>_trimmed.R2.fastq.gz>

In [None]:
#with snakemake

cd $my_path/<dataset_dir>
conda activate snakemake
#dry run (use appropriate single/pe file depending on data)
snakemake -np -s $my_path/snakemake/fastp/pe/snakefile.smk
#run in background
nohup snakemake --cores 8 -s $my_path/snakemake/fastp/pe/snakefile.smk > nohup.out 2>&1 &

Map the trimmed reads with BWA-mem

In [None]:
module load bwa
module load samtools

#create index file for genome in same directory as genome file
bwa index AL123456_3.fasta

# using shell script from Yen-Yi (should check that filenames and paths correct)
#!/bin/bash

# Runs bwa in paired-end  mode, sorts and indexes files 

# Run as:
# nohup sh BWA_PE.sh directory_of_fastq_files samples

timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1  #all output will be logged to logfile

dir=$1
shift

#set location of executables
SAMTOOLS_EXEC=<PATH TO EXEC>

#set parameters
genomeFile=<GENOME_FILE> #index files should be in same directory
numProc=8

#extension for fastq files
suffix1="<sample_name>_trimmed.R1.fastq.gz"
suffix2="<sample_name>_trimmed.R1.fastq.gz"
EXT=fastq.gz

for sample in *.${EXT};
do
  sample=$(echo $sample | cut -f 1 -d '_')
  echo "Running bwa on sample $sample (paired-end mode)..."

  pairedFile1="$dir$sample$suffix1".gz
  if [ -f $pairedFile1 ]
    then
      gzip -d $pairedFile1
      pairedFile1=$dir$sample$suffix1
  else
    pairedFile1=$dir$sample$suffix1
    if [ ! -f $pairedFile1 ]
      then
        echo "File not found: $pairedFile1"
        exit $?
    fi
  fi
  pairedFile2="$dir$sample$suffix2".gz
  if [ -f $pairedFile2 ]
    then
      gzip -d $pairedFile2
        pairedFile2=$dir$sample$suffix2
  else
    pairedFile2=$dir$sample$suffix2
      if [ ! -f $pairedFile2 ]
        then
          echo "File not found: $pairedFile2"
            exit $?
      fi
  fi

  tmpSam="$sample"_pe.sam
  tmpBam="$sample"_pe.bam
           finalSortedBam="$sample"_sorted.bam
   
  #align 
  $BWA_EXEC mem -t $numProc $genomeFile $pairedFile1 $pairedFile2 > $tmpSam

  #create bam file
  $SAMTOOLS_EXEC view $tmpSam -Sbo $tmpBam
  $SAMTOOLS_EXEC sort $tmpBam -o $finalSortedBam
  $SAMTOOLS_EXEC index $finalSortedBam

  #cleanup
  /bin/rm $tmpSam $tmpBam
  gzip -9 $pairedFile1 $pairedFile2
done

In [None]:
#Mapping output quality check script

module load samtools

#requires a bed file for genome

#!/bin/bash
timestamp=`date "+%Y%m%d-%H%M%S"`
logfile="run_$timestamp.log"
exec > $logfile 2>&1  #all output will be logged to logfile

dir=$1
shift

EXT=bam
ref_genome="<genome_file/ref_genomic.bed>"
SUFFIX="_sorted.bam"
for sample in  *.${EXT};
    do
        sample=$(echo $sample | cut -f 1 -d '_')
        echo "Running mapping quality scripts on sample $sample..."
        echo "sample is $sample"
        quality_check=$dir$sample$SUFFIX
        samtools flagstat $quality_check > "flagstat_$sample.txt"
        echo "Mapping output quality check for $sample done..."
    done

mkdir flagstat_ouput 
mv *flagstat* flagstat_output 
multiqc ./

In [None]:
# with snakemake (maps, sorts, indexes and creates flagstats report)
cd $my_path/<dataset_dir>
conda activate snakemake
snakemake -np -s $my_path/snakemake/map_bwa/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/map_bwa/pe/snakefile.smk > nohup_map.out 2>&1 &

## It is useful to have bam coverage files to use with IGV

In [None]:
module load python/v3
#makes a separate bigwig file for forward and reverse strands
bamCoverage -b sorted_reads/{sample}.bam -o covg_bigwigs/{sample}_fwd.bw -of bigwig --filterRNAstrand forward -p 8 --binSize 1 --extendReads
bamCoverage -b sorted_reads/{sample}.bam -o covg_bigwigs/{sample}_rev.bw -of bigwig --filterRNAstrand reverse -p 8 --binSize 1 --extendReads

In [None]:
#with snakmake
cd <dataset_dir>
snakemake -np -s $my_path/snakemake/bam_coverage/snakefile.smk
snakemake --cores 3 -s $my_path/snakemake/bam_coverage/snakefile.smk