# QC of raw data

The first step should be to check the overall quality of the sequenced reads. A poor RNA-seq run will be characterized by the presence of one or more of the following types of uninformative sequences:
- PCR duplicates
- adapter contamination
- rRNA and tRNA reads
- unmappable reads, e.g. from contaminating nucleic acids

All but the last category of possible problems can be detected using FASTQC.

Note that the dataset used to illustrate this pipeline is synthetic data.

In [1]:
path<-"/mnt/data/GWES/RNAseq/input/ASTRO_DUMMY"
outpath<-"/mnt/data/GWES/RNAseq/output/ASTRO_DUMMY/fastQC"

# create dir for output
dir.create(file.path(outpath),recursive = TRUE)

In [2]:
list.files(path=paste(path,"FASTQ",sep="/"))

In [3]:
for (fastq in list.files(path=paste(path,"FASTQ",sep="/"))){
    cmd=sprintf("fastqc %s/%s --extract -o %s",paste(path,"FASTQ",sep="/"),fastq,outpath)
    cat(system(cmd, intern=TRUE),sep='\n')
}

Analysis complete for E22C6astro1_S27_L001_R1_001.fastq.gz
Analysis complete for E22C6astro1_S27_L002_R1_001.fastq.gz
Analysis complete for E22C6astro1_S27_L003_R1_001.fastq.gz
Analysis complete for E22C6astro1_S27_L004_R1_001.fastq.gz
Analysis complete for E22C6astro2_S28_L001_R1_001.fastq.gz
Analysis complete for E22C6astro2_S28_L002_R1_001.fastq.gz
Analysis complete for E22C6astro2_S28_L003_R1_001.fastq.gz
Analysis complete for E22C6astro2_S28_L004_R1_001.fastq.gz
Analysis complete for E22C6astro3_S29_L001_R1_001.fastq.gz
Analysis complete for E22C6astro3_S29_L002_R1_001.fastq.gz
Analysis complete for E22C6astro3_S29_L003_R1_001.fastq.gz
Analysis complete for E22C6astro3_S29_L004_R1_001.fastq.gz
Analysis complete for E33C2astro1_S33_L001_R1_001.fastq.gz
Analysis complete for E33C2astro1_S33_L002_R1_001.fastq.gz
Analysis complete for E33C2astro1_S33_L003_R1_001.fastq.gz
Analysis complete for E33C2astro1_S33_L004_R1_001.fastq.gz
Analysis complete for E33C2astro2_S10_L001_R1_001.fastq.

**You can inspect each html report individually or use a tool like multiqc to group the results into a single html.**

In [9]:
# run multiqc
cmd=sprintf("export LC_ALL=C.UTF-8 && export LANG=C.UTF-8 && multiqc %s --dirs --interactive -o %s",outpath,paste(outpath,"multiqc",sep="/"))
cat(system(cmd, intern=TRUE),sep='\n')




In [10]:
# either open the resulting html on the current directory on Jupyter (Execute cp command) or on your host machine 
cmd=sprintf("cp %s/multiqc/multiqc_report.html %s",outpath,getwd())
cat(system(cmd, intern=TRUE))

In [11]:
# to quickly inspect some samples you can have a look at the summary.txt file
cmd=sprintf("cat %s/EKOC3astro3_S38_L004_R1_001_fastqc/summary.txt", outpath)
cat(system(cmd, intern=TRUE),sep='\n')

PASS	Basic Statistics	EKOC3astro3_S38_L004_R1_001.fastq.gz
PASS	Per base sequence quality	EKOC3astro3_S38_L004_R1_001.fastq.gz
WARN	Per tile sequence quality	EKOC3astro3_S38_L004_R1_001.fastq.gz
PASS	Per sequence quality scores	EKOC3astro3_S38_L004_R1_001.fastq.gz
FAIL	Per base sequence content	EKOC3astro3_S38_L004_R1_001.fastq.gz
PASS	Per sequence GC content	EKOC3astro3_S38_L004_R1_001.fastq.gz
PASS	Per base N content	EKOC3astro3_S38_L004_R1_001.fastq.gz
WARN	Sequence Length Distribution	EKOC3astro3_S38_L004_R1_001.fastq.gz
WARN	Sequence Duplication Levels	EKOC3astro3_S38_L004_R1_001.fastq.gz
PASS	Overrepresented sequences	EKOC3astro3_S38_L004_R1_001.fastq.gz
PASS	Adapter Content	EKOC3astro3_S38_L004_R1_001.fastq.gz
