# RNA-seq Processing Notebook

Jennifer Stiens
j.j.stiens@gmail.com
Birkbeck, University of London

## Date started:  06-10-22

### Notebook for processing Mbovis RNA seq data



In [3]:
!docker run --rm -it avsastry/get-all-rnaseq:latest "Mycobacterium tuberculosis variant bovis AF2122/97" > Mbovis_rna.tsv

docker: Error response from daemon: dial unix /Users/jenniferstiens/Library/Containers/com.docker.docker/Data/docker.raw.sock: connect: connection refused.
See 'docker run --help'.


this won't work in conda env but will work outside conda. in bash shell, open docker app (in apps), select avsastry docker container, then run command

But this doesn't like the whole name--rejects the variant bovis bit.

Got accessions from SRA by searching. There are 3 from 2022 and 6 from 2018. Use one step method. 

In [None]:
!cd ncbi
!module load ncbi-sra/v2.10.5 #(in /s/software/modules)

!#make shell script to iterate through accession numbers (iterate_fasterq.sh)
!#!/bin/bash

!while IFS= read -r line;
!do
!	echo "accession number: 	$line"
!	#call fasterq to download from sra
!	fasterq-dump ${line} -O files/
!	echo -e "########################\n\n"
!done < "$1"



!nohup bash iterate_fasterq.sh accession_list.txt &> fasterq_dump.out &

Report from iterate_fasterq.out

accession number: 	SRR18961402
spots read      : 63,186,291
reads read      : 126,372,582
reads written   : 126,372,582
########################


accession number: 	SRR18961411
spots read      : 74,133,084
reads read      : 148,266,168
reads written   : 148,266,168
########################


accession number: 	SRR18961412
spots read      : 73,101,699
reads read      : 146,203,398
reads written   : 146,203,398
########################


In [None]:
*QC on fastq files*

Sanity checks to make sure reads are intact.


In [None]:

#1) Check for read length

!head -50 <file.fastq>

#2) Count number of reads:  R1 and R2 should match
!wc -l <file.fastq>
# loop through and count reads:
!FILES=`ls *.fastq`
!for file in $FILES; do wc -l $file; done;
#or
!find . -name '*.fastq' -exec wc -l {} +

Read lengths seem to vary between 120 to 151.
The line count for each fastq file are exactly double the 'reads read' and 'reads written' reported by fasterq (two lines per read). R1 and R2 match for all fastq files.

Iterate through fastq files and do fastqc and multiqc on the dataset.

In [None]:

!module load python/v3

!#!/bin/bash

!# iterate_fastqc.sh
!# usage: bash iterate_fastqc.sh

!FILES=*.fastq

!for file in $FILES
do
	!filename=$(basename "$file")
	!filename="${filename%.*}"

	!echo "File on the loop: 	$filename"

	!#call fastQC quality analysis
	!/s/software/fastqc/v0.11.8/FastQC/fastqc ${file}

	!echo -e "########################\n\n"
!done


# Run MultiQC
# -f overwrites existing files, . runs with files in current directory, -o output directory
!echo "Running MultiQC..."
# Moves output into new folder
    !mkdir ./fast_QC_outputs
    !mv *fastqc.zip ./fast_QC_outputs
    !mv *fastqc.html ./fast_QC_outputs

    # Run multiqc to compile outputs
    !cd fast_QC_outputs
!multiqc -f .

Sample	     percent_duplicates 
percent_gc	avg_sequence_length	percent_fails	total_sequences
SRR18961402_1	95.74886897667076	62.0	143.76444751283154	40.0	63186291.0
SRR18961402_2	94.71897274594298	62.0	144.53112903873406	40.0	63186291.0
SRR18961411_1	96.0286096462345	62.0	143.98847565818252	40.0	74133084.0
SRR18961411_2	95.16330953555591	61.0	144.60963914572878	40.0	74133084.0
SRR18961412_1	95.86770216996337	62.0	143.48920810992368	40.0	73101699.0
SRR18961412_2	95.55483256230265	62.0	144.26212813466893	40.0	73101699.0

I'm always confused by the high level of duplicates? Why do I always seem to see this?
Found this on Biostars: (https://www.biostars.org/p/14283/)
"Observing high rates of read duplicates in RNA-seq libraries is common. It may not be an indication of poor library complexity caused by low sample input or over-amplification. It might be caused by such problems but it is often because of very high abundance of a small number of genes (usually ribosomal or mitochondrial house keeping genes). For example, I have seen libraries where ~60% of all reads mapped to the 2-10 most highly expressed genes. Sometimes 75% of all reads map to the top 0.1% of expressed genes. The result of such heavy sampling of these genes is a high number of duplicate reads (even when considering read pairs in assessing duplicates)."

[Multiqc report](file:///Users/jenniferstiens/myco_projects/rna_seq_processing/qc/multiqc/multiqc_report.html)

Might want to zip the fastq files as they are very big and move them to project name directory.

In [None]:
!gzip *.fastq
!mv *.fastq.gz PRJNA832959/ 


Trimming step:
adapter files in ~/git/tn_seq/data/adapters_all.fa

I didn't run this--maybe run fastp instead (see rna_processing.smk)

In [None]:
##DON'T RUN


#For trimmomatic, the modules should be loaded with:
        #!/bin/bash
        # To load trimmomatic 
        module load trimmomatic

#The following is a script to run trimmomatic on single end samples:
        #!/bin/bash
        # iterate_trimmomatic.sh
        # Runs Trimmomatic in PE mode for all sample names given as arguments
        # Run as:
        # nohup bash $my_path/scripts/iterate_trimmomatic.sh PRJNA488546


        !timestamp=`date "+%Y%m%d-%H%M%S"`
        !logfile="run_$timestamp.log"
        !exec > $logfile 2>&1  #all output will be logged to logfile

        !TRIM_EXEC="/s/software/trimmomatic/Trimmomatic-0.38/trimmomatic-0.38.jar"
        !DIR=$1
        !shift

        !echo "Running Trimmomatic using executable: $TRIM_EXEC"

        !for file in `ls $DIR/*.fastq.gz` ;
        !do
          !echo "File on Loop: ${file}"
          !sample=${file/$DIR\/}
          !sample=${sample/.fastq.gz/}
          !echo "Sample= $sample"
          
          !java -jar $TRIM_EXEC SE -threads 12 -phred33 \
               !-trimlog "$sample"_trim_report.txt \
               !"$DIR/$sample".fastq.gz "$sample"_trimmed.fastq.gz \
               !ILLUMINACLIP:/s/software/trimmomatic/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 \
               !LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

          !gzip "$sample"_trim_report.txt
       ! done
        
        
module load trimmomatic
cd $my_path/ncbi/files/PRJNA488546
nohup bash $my_path/scripts/iterate_trimmomatic.sh PRJNA488546 >& iterate_trim.out &


*Mapping reads to genome using BWA-mem*

In [None]:
#DON'T RUN

module load bwa
#create index files
bwa index AL123456_3.fasta

module load samtools

# bash script (adapted from yen yi)

#!/bin/bash

# Run as:
# nohup bash BWA_single.sh directory_of_fastq_files


10 October

Starting snakemake pipeline for processing RNA

First set up environment for rna including snakemake

In [None]:
#install mamba in conda
conda install -n base -c conda-forge mamba
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake

conda activate snakemake
conda install -c bioconda bwa samtools fastqc multiqc fastp rseqc sra-tools

Made snakefile 'rna_processing.smk'
AttributeError: 'str' object has no attribute 'name'
I'm still getting this error I got when using snakemake with the tutorial. updated conda and python
env was using, 3.1, but I updated to 3.9.13

In [None]:
#removed snakemake env on laptop and reinstalled as above
conda env remove --name snakemake
# updated python in snakemake env (updates to 3.9.13) (this seems to do trick)
conda update python       
#from base
conda update -n base conda
#still not showing update for ==> WARNING: A newer version of conda exists. <==
#   current version: 4.12.0
#   latest version: 22.9.0

# Please update conda by running

#     $ conda update -n base conda

# try this
conda install -n base conda=22.9.0 python=3.9
#this got stuck, tried without python, also stuck, try with python=3.8?

#supposedly because using conda-forge before defaults? can try this or update ~/.condarc file to put defaults in priority
conda install -n base defaults::conda=22.9

Looked online, and version 4.12 IS THE LATEST VERSION! It just must be a mistake?
https://docs.anaconda.com/anaconda/reference/release-notes/

In [None]:


#checking how these will run
snakemake -s rna_processing.smk -np trimmed/trimmed_SRR18961402_1.fastq.gz
rna_seq_processing % snakemake -s rna_processing.smk -np fastqc_outputs/multiqc_report.html
snakemake --dag trimmed/trimmed_SRR18961402_{1,2} | dot -Tsvg > dag.svg

error trying to visualise graph of dag. seems not to recognise 'dot' command which is from graphviz. Install graphviz (maybe this is because conda still not updated--do again above)

In [None]:
conda install -c anaconda graphviz

Run fastp alone

In [None]:
#dry run
snakemake -np -s rna_processing.smk
#forced execution of fastp only
snakemake --cores 8 -s rna_processing.smk -R fastp_pe

This worked great for one file on my laptop. Uses a lot of resources, so do all of them on thoth.

Test mapping_snakefile.smk on laptop
First must have index file for genome file in directory

In [None]:
bwa index Mbovis_AF212297.fasta

In [None]:
snakemake -np -s mapping_snakefile.smk
#only using 2 cores on my machine, since I only have 4!
snakemake --cores 2 -s mapping_snakefile.smk

#can run in background with nohup (stderr and stdout to nohup.out--or whatever you name it)
nohup snakemake ... > nohup.out 2>&1 &

#create disk image of pipeline
snakemake --dag sorted_reads/flagstat_SRR18961402.txt | dot -Tsvg > dag.svg

fasterq not working--doesn't seem to be included in conda sra-tools
try module load ncbi-sra/v2.10.5 and see if that works?

Works on thoth but can't use on laptop

In [None]:
11 October map the fastq files using snakemake files on thoth

In [None]:
conda activate snakemake
#run from inside project folder 
cd $my_path/mbovis_rna/PRJNA832959/

#dry run
snakemake -np -s $my_path/snakemake/mapping_snakefile.smk
#need to add path to config.yaml or needs to be in folder, moved snakmake folders as well
$my_path/snakemake/config.yaml .

snakemake -np -s mapping_snakefile.smk

Had to delete env and make new snakemake installation and env on thoth to get to work.

Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
all                      1              1              1
bwa_map_pe               3              1              1
samtools_flagstat        3              1              1
samtools_sort            3              1              1
total                   10              1              1

In [None]:

#run for real
snakemake --cores 8 -s mapping_snakefile.smk

Error in rule bwa_map_pe:
    jobid: 3
    output: mapped_reads/SRR18961402.bam
    log: logs/bwa_mem/SRR18961402.log (check log file(s) for error message)
    shell:
        (bwa mem -t 8 /d/in16/u/sj003/refseqs/mbovis/Mbovis_AF2122_97.fasta /d/in16/u/sj003/mbovis_rna/PRJNA832959/fastq/SRR18961402_1.fastq.gz /d/in16/u/sj003/mbovis_rna/PRJNA832959/fastq/SRR18961402_2.fastq.gz /d/in16/u/sj003/mbovis_rna/PRJNA832959/fastq/SRR18961411_1.fastq.gz /d/in16/u/sj003/mbovis_rna/PRJNA832959/fastq/SRR18961411_2.fastq.gz /d/in16/u/sj003/mbovis_rna/PRJNA832959/fastq/SRR18961412_1.fastq.gz /d/in16/u/sj003/mbovis_rna/PRJNA832959/fastq/SRR18961412_2.fastq.gz | samtools view -Sb - > mapped_reads/SRR18961402.bam) 2> logs/bwa_mem/SRR18961402.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[main_samview] fail to read the header from "-".

Realised I was inputting all the fastq files at once with 'expand'. have adjusted rule.

In [None]:
# run fastqc, multiqc, and fastpe for pre-processing using snakefile (probably should have used nohup)
snakemake -np -s rna_processing.smk
snakemake --cores 4 rna_processing.smk #next time do core per read file--in this case 6

#getting an error because some files aren't ready for multiqc. need to increase latency, or easier to just do multiqc independently.
#had already run fastqc and multiqc so just omitted this step



Mapping to H37Rv

In [None]:
#run mapping on trimmed and filtered files from fastpe
nohup snakemake --cores 8 -s mapping_snakefile.smk > nohup.out 2>&1 &

#realised it didn't create index files--added .bam.bai output files to 'rule_all'
snakemake --cores 3 -s mapping_snakefile.smk --forcerun samtools_index

New snakefile to make bigwig covg files from bam files. maybe add to mapping file?

In [None]:
cd /Volumes/Data_disk/mbovis_rna
#make sure config.yaml is in directory

snakemake -np -s ~/myco_projects/rna_seq_processing/bam_coverage.smk

snakemake --cores 3 -s ~/myco_projects/rna_seq_processing/bam_coverage.smk

In general, it may be better to have many small snakefile/workflows and import them into one larger 'Snakefile.smk', then can just call 'snakemake --cores n' without listing script 

https://snakemake.readthedocs.io/en/v7.15.2/snakefiles/modularization.html

In [None]:
#SNAKEFILE for rna download, proc

Working on wrapper for Baerhunter

Using tutorial at https://github.com/dohlee/snakemake-hamburger

Basic snakmake tutorial at https://github.com/leipzig/SandwichesWithSnakemake suggests installing ryp2 if you are going to use R in script

In [None]:
conda activate snakemake
pip install rpy2

I have decided not to try to implement bh in snakemake--it will be difficult to install all the R dependencies in this env. Perhaps later when I can do docker or singularity, then can have linked workflows with one env

20 October, 2022

Running 'sra_download.smk' on thoth to download another mbovis dataset. finding an error

Job Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait

otherwise, can execute in batches

In [None]:
conda activate snakemake
module load ncbi-sra/v2.10.5
cd ncbi
snakemake --latency-wait 60 --cores 8 -s $my_path/snakemake/sra_download.smk 
#wait 60 seconds
nohup snakemake --latency-wait 60 --cores 8 -s $my_path/snakemake/sra_download.smk > nohup.out 2>&1 &
#maybe wasn't because of latency--might have been because had some single end files, so no read 2 forthcoming?
#still got error re latency
nohup snakemake --latency-wait 60 --cores 8 -s $my_path/snakemake/sra_download_single.smk > nohup.out 2>&1 &
#same error downloads one fastq and exits with error try batch?

Realised directory for output and input files was 'data/*.fastq' but the fasterq puts into 'files'. So it can't find files when listing.

In [None]:
nohup snakemake --cores 8 -s $my_path/snakemake/sra_download_single.smk > nohup.out 2>&1 &

This worked, finally! 

Run fastqc on all files in fastq

INstalled 'tree' so I can visualise the directories for snakemake files (esp if I make wrappers?). Reorganised directory so all snakefiles are called 'snakefile.smk' but are in specific directories relative to rule

```
(snakemake) bash-4.2$ tree
.
├── bam_coverage
│   └── snakefile.smk
├── config1.yaml
├── config.yaml
├── dag.svg
├── fastp
│   ├── pe
│   └── single
├── fastqc
│   └── snakefile.smk
├── map_bwa
│   ├── pe
│   │   └── snakefile.smk
│   └── single
└── sra
    ├── pe
    │   └── snakefile.smk
    └── single
        └── snakefile.smk
```
11 directories, 8 files

this way I don't have to figure out name of each script (pe vs single etc)

In [None]:
cd mbovis_rna
cd PRJNA774648
conda activate snakemake
snakemake -np -s $my_path/snakemake/fastp/single/snakefile.smk

In [None]:
#config.yaml
PATH: "/d/in16/u/sj003/"
single_samples: [SRR16574977, SRR16574976, SRR16574973]
pe_samples: [SRR16574969, SRR16574968, SRR16574967]

#can't mix = and : in one config file, all has to be a 
#also, has to be explicit directory, can't use aliases

In [None]:
nohup snakemake --cores 8 -s $my_path/snakemake/fastp/single/snakefile.smk > nohup.out 2>&1 &

this worked well. now on to mapping

In [None]:
cd $my_path/mbovis_rna/PRJNA774648
conda activate snakemake
snakemake -np -s $my_path/snakemake/map_bwa/single/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/map_bwa/single/snakefile.smk > nohup.out 2>&1 &

28 October downloading some msmeg fastqs to map to see if phoP/R is present

paired end data from PRJNA838962 wild type samples only

In [None]:
mdir msmeg_rna/PRJNA838962
cat msmeg_rna/PRJNA838962/config.yaml


project: PRJNA527616

accession: [SRR19242553, SRR19242554, SRR19242555]  #paired-end
genomeFile: /d/in16/u/sj003/refseqs/msmeg/NC_008596.1.fasta
samples: [SRR19242553, SRR19242554, SRR19242555]
PATH: "/d/in16/u/sj003/"

conda activate snakemake
cd ncbi
module load ncbi-sra/v2.10.5    
snakemake -np -s $my_path/snakemake/sra/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/sra/pe/snakefile.smk > nohup.out 2>&1 &

run fastp with paired end

In [None]:
cd msmeg_rna
cd PRJNA838962/
conda activate snakemake
snakemake -np -s $my_path/snakemake/fastp/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/fastp/pe/snakefile.smk > nohup.out 2>&1 &


map with bwa-mem

In [None]:
mv trimmed/*.fastq.gz trimmed/pe/
cd $my_path/msmeg_rna/PRJNA838962
conda activate snakemake
snakemake -np -s $my_path/snakemake/map_bwa/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/map_bwa/pe/snakefile.smk > nohup_map.out 2>&1 &

3 November
These were using unstranded protocol. Using new project (PRJNA820116) and repeating.

2 March

## Downloading and processing RNA seq from sigmaE deletion mutant and WT from PRJNA869087

Want to check levels of phoPR and antisense-phoR in the sigE mutant. All paired end.

In [None]:
# begin by downloading fastq files from sra (on thoth server)

mkdir $my_path/mtb_rna/PRJNA838962
cd $my_path/mtb_rna/PRJNA838962

#make config.yaml file in directory including following line to indicate accessions:
#accession: [SRR21026195,SRR21026196,SRR21026197,SRR21026198,SRR21026199,SRR21026200]

conda activate snakemake
module load ncbi-sra/v2.10.5

snakemake -np -s $my_path/snakemake/sra/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/sra/pe/snakefile.smk > nohup.out 2>&1 &

All files successfully downloaded. The sanity check doesn't work because snakemake makes the sc.txt file before any .gz files have been generated. Need to change that. I downloaded into project file instead of ncbi file. Changed fastp snakefile script to reflect this.

Use fastp to trim and give quality report

In [None]:
cd $my_path/mtb_rna/PRJNA838962
conda activate snakemake
snakemake -np -s $my_path/snakemake/fastp/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/fastp/pe/snakefile.smk > nohup.out 2>&1 &

Somehow managed to miss out last accession and substitute an old one. need to download this one and do fastp

In [None]:
#change config to reflect single accession and sample and re-run
snakemake -np -s $my_path/snakemake/fastp/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/fastp/pe/snakefile.smk > nohup.out 2>&1 &

multiqc .

In [45]:
import json
with open ('/Users/jenniferstiens/myco_projects/rna_seq_processing/data/sigE_rna_seq_runs/multiqc_data/multiqc_data.json', 'r') as f:
        multiqc_data = json.load(f)
        # view data in 'pretty print' format
        #print(json.dumps(multiqc_data, indent=4))
        # get keys for samples
        samples = list(multiqc_data['report_data_sources']['fastp']['all_sections'].keys())
        # create new dictionary for sample : read counts
        reads_count = {}
        for x in samples:
                gen_stats = multiqc_data['report_general_stats_data'][0][x]
                reads_count[x]=gen_stats['filtering_result_passed_filter_reads']
        for k,v in reads_count.items():
                print(k,v)        

SRR21026199_fastp 69952114.0
SRR21026195_fastp 81551958.0
SRR21026197_fastp 71654586.0
SRR21026196_fastp 89879690.0
SRR21026198_fastp 61663800.0
SRR21026200_fastp 60050326.0


Quality looks good but 32-37% duplication (always seems to be high--deep sequencing?). Over 60M sequences for each sample.

Map with bwa-mem. Use whatever H37Rv genome file you want to use to map. I used AL123456.3

In [None]:
snakemake -np -s $my_path/snakemake/map_bwa/pe/snakefile.smk
nohup snakemake --cores 8 -s $my_path/snakemake/map_bwa/pe/snakefile.smk > nohup_map.out 2>&1 &