# Preprocessing part


The main goal of this notebook to preprocess the input fastq files
### Input: 
   * directory where the fastq files are stored
   
### Outputs:
  * quality control analysis using the fastqc
  * summarized quality control provided by multiqc

### Reuqrements:
  * fastq, multiqc, trimmomatic
  * turbine_lib.py (it contains some support functions)
  * input data, project directories
  * inputs here assumed to be single reads
  * the optimaized preprocessing parameters (assses after the QC step)
  


### Assumptions and notes
  * the proper paths and project data should setted before the run
  * The proper preprocessing paramaters for the trimmomatic should set based on the QC results of the raw data


In [28]:
#seting up envirement
import os

import seaborn as sns; sns.set()
import math
import os
import sys
import pandas as pd
import matplotlib.pyplot as plt
from multiprocessing import Pool
from os.path import join, isfile, splitext
from scipy.stats.mstats import gmean
from os import listdir
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import random
from turbine_lib import *
import multiprocessing

%load_ext autoreload
%autoreload 2

# Input
base_path = '/home/ligeti/gitrepos/turbine-rnaseq-ligeti'
data_source_dir = join(base_path, 'data', 'raw_data')


# Inputs, parameters
trimmomatic_path = 'trimmomatic'
trimmomatic_adapter_bpath = join(base_path,'data', 'adapters')
trimmomatic_adapter_path = join(trimmomatic_adapter_bpath, 'TruSeq2-SE.fa')
trimmomatic_parameters = 'ILLUMINACLIP:{0}:2:30:10:2 SLIDINGWINDOW:5:25 MINLEN:30 HEADCROP:12'.format(trimmomatic_adapter_path)


# Outputs
results_path = join(base_path, 'results')
quality_control_output = join(results_path, 'raw_qc/')
quality_control_summary_output = join(results_path, 'raw_multiqc/')
trimmed_quality_control_output = join(results_path, 'trimmed_qc/')
trimmed_quality_control_summary_output = join(results_path, 'trimmed_multiqc/')

# trimming
trimmed_path = join(results_path, 'trimmed_files/')

# Other paramaters
# Core counts
number_of_threads = multiprocessing.cpu_count()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# Creating project directory structure

if not os.path.exists(results_path):
    os.makedirs(results_path)
if not os.path.exists(quality_control_output):
    os.makedirs(quality_control_output)
if not os.path.exists(quality_control_summary_output):
    os.makedirs(quality_control_summary_output)    
if not os.path.exists(trimmed_path):
    os.makedirs(trimmed_path)    
if not os.path.exists(trimmed_quality_control_output):
    os.makedirs(trimmed_quality_control_output)
if not os.path.exists(trimmed_quality_control_summary_output):
    os.makedirs(trimmed_quality_control_summary_output)


# Running QC with fastqc and multiQC


In [3]:
run_qc(data_source_dir, quality_control_output, quality_control_summary_output, default_fastq_ext='fastq', threads=number_of_threads)


Running raw fastqc: fastqc --outdir /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/      -t 12      /home/ligeti/gitrepos/turbine-rnaseq-ligeti/data/raw_data/*fastq     


Started analysis of sum159_dmso1.fastq
Started analysis of sum159_dmso2.fastq
Started analysis of sum159_jq1.fastq
Started analysis of sum159_jq2.fastq
Approx 5% complete for sum159_dmso1.fastq
Approx 5% complete for sum159_dmso2.fastq
Approx 5% complete for sum159_jq1.fastq
Approx 5% complete for sum159_jq2.fastq
Approx 10% complete for sum159_dmso1.fastq
Approx 10% complete for sum159_dmso2.fastq
Approx 10% complete for sum159_jq1.fastq
Approx 10% complete for sum159_jq2.fastq
Approx 15% complete for sum159_dmso1.fastq
Approx 15% complete for sum159_dmso2.fastq
Approx 15% complete for sum159_jq1.fastq
Approx 15% complete for sum159_jq2.fastq
Approx 20% complete for sum159_dmso1.fastq
Approx 20% complete for sum159_dmso2.fastq
Approx 20% complete for sum159_jq1.fastq
Approx 20% complete for sum159_jq2.fastq
Approx 25% complete for sum159_dmso1.fastq
Approx 25% complete for sum159_dmso2.fastq
Approx 25% complete for sum159_jq1.fastq
Approx 25% complete for sum159_jq2.fastq
Approx 30% c

Analysis complete for sum159_dmso1.fastq


Approx 100% complete for sum159_jq1.fastq
Approx 100% complete for sum159_dmso2.fastq


Analysis complete for sum159_jq1.fastq
Analysis complete for sum159_dmso2.fastq


Approx 100% complete for sum159_jq2.fastq


Analysis complete for sum159_jq2.fastq
Running raw multiqc: multiqc -f --interactive  --outdir /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_multiqc/     /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/*     



  /// MultiQC 🔍 | v1.12

|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_dmso1_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_dmso1_fastqc.zip
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_dmso2_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_dmso2_fastqc.zip
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_jq1_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_jq1_fastqc.zip
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_jq2_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/raw_qc/sum159_jq2_fastqc.zip


|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 8/8  


|            fastqc | Found 4 reports
|           multiqc | Compressing plot data
|           multiqc | Deleting    : ../results/raw_multiqc/multiqc_report.html   (-f was specified)
|           multiqc | Deleting    : ../results/raw_multiqc/multiqc_data   (-f was specified)
|           multiqc | Report      : ../results/raw_multiqc/multiqc_report.html
|           multiqc | Data        : ../results/raw_multiqc/multiqc_data
|           multiqc | MultiQC complete


In [20]:
 # Generating links to the results
multi_qc_results_link = '<h1>QC summary:</h1> <a href="https://localhost:8888/files/{0}/multiqc_report.html"> {1} </a> \
 <p> (For details visit the documentation of multiQC <a href="https://multiqc.info"> https://multiqc.info </a>) </p> <br><br>\
 '.format(quality_control_summary_output,'MultiQC results')
display(HTML(multi_qc_results_link))


# Preprocessing with trimmomatic
Trimmomatic will be used for preprocessing the reads. For the specified parameters, see the input parameters cell.
Trimmomatic provides various options for cleaning the raw sequencing reads. It detects and filters the adapter sequences, it can crop the begninng and the end of the sequences. Here we cut the raw reads if the average qulaity score drops under 25 in a windows size of 5. Short sequences lengt < 30will be discarded.


In [16]:
prefix_to_pair = get_illumina_pairs(data_source_dir)
trim_cmds = [get_trimmomatic_cmd_default(trimmed_path, fastq_files, trimmomatic_path, trimmomatic_parameters) for act_sample, fastq_files in prefix_to_pair.items()]
with  Pool(number_of_threads) as p:
    p.map(os.system, trim_cmds)


TrimmomaticSE: Started with arguments:
 /home/ligeti/gitrepos/turbine-rnaseq-ligeti/data/raw_data/sum159_jq1.fastq /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_files/sum159_jq1.trimmed.fastq ILLUMINACLIP:/home/ligeti/gitrepos/turbine-rnaseq-ligeti/data/adapters/TruSeq2-SE.fa:2:30:10:2 SLIDINGWINDOW:5:25 MINLEN:30 HEADCROP:12
Automatically using 1 threads
TrimmomaticSE: Started with arguments:
TrimmomaticSE: Started with arguments: /home/ligeti/gitrepos/turbine-rnaseq-ligeti/data/raw_data/sum159_jq2.fastqTrimmomaticSE: Started with arguments:
 /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_files/sum159_jq2.trimmed.fastq ILLUMINACLIP:/home/ligeti/gitrepos/turbine-rnaseq-ligeti/data/adapters/TruSeq2-SE.fa:2:30:10:2
 SLIDINGWINDOW:5:25 /home/ligeti/gitrepos/turbine-rnaseq-ligeti/data/raw_data/sum159_dmso2.fastq MINLEN:30 HEADCROP:12 /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_files/sum159_dmso2.trimmed.fastq
 ILLUMINACLIP:/home/ligeti/gitrepos/tu

[0, 0, 0, 0]

In [63]:
# running QC on filtered data
run_qc(trimmed_path, trimmed_quality_control_output, trimmed_quality_control_summary_output, default_fastq_ext='fastq')
trimmed_qc_table = get_qc_html_table(trimmed_quality_control_output)


Running raw fastqc: fastqc --outdir /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/      -t 20      /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_files/*fastq     


Started analysis of sum159_dmso1.trimmed.fastq
Started analysis of sum159_dmso2.trimmed.fastq
Started analysis of sum159_jq1.trimmed.fastq
Started analysis of sum159_jq2.trimmed.fastq
Approx 5% complete for sum159_dmso1.trimmed.fastq
Approx 5% complete for sum159_dmso2.trimmed.fastq
Approx 5% complete for sum159_jq1.trimmed.fastq
Approx 10% complete for sum159_dmso1.trimmed.fastq
Approx 5% complete for sum159_jq2.trimmed.fastq
Approx 10% complete for sum159_dmso2.trimmed.fastq
Approx 10% complete for sum159_jq1.trimmed.fastq
Approx 15% complete for sum159_dmso1.trimmed.fastq
Approx 10% complete for sum159_jq2.trimmed.fastq
Approx 15% complete for sum159_dmso2.trimmed.fastq
Approx 15% complete for sum159_jq1.trimmed.fastq
Approx 20% complete for sum159_dmso1.trimmed.fastq
Approx 15% complete for sum159_jq2.trimmed.fastq
Approx 20% complete for sum159_dmso2.trimmed.fastq
Approx 20% complete for sum159_jq1.trimmed.fastq
Approx 20% complete for sum159_jq2.trimmed.fastq
Approx 25% complete 

Analysis complete for sum159_dmso1.trimmed.fastq
Analysis complete for sum159_jq2.trimmed.fastq
Analysis complete for sum159_jq1.trimmed.fastq
Analysis complete for sum159_dmso2.trimmed.fastq
Running raw multiqc: multiqc -f --interactive  --outdir /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_multiqc/     /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/*     



  /// MultiQC 🔍 | v1.12

|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/sum159_dmso1.trimmed_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/sum159_dmso1.trimmed_fastqc.zip
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/sum159_dmso2.trimmed_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/sum159_dmso2.trimmed_fastqc.zip
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/sum159_jq1.trimmed_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/sum159_jq1.trimmed_fastqc.zip
|           multiqc | Search path : /home/ligeti/gitrepos/turbine-rnaseq-ligeti/results/trimmed_qc/sum159_jq2.trimmed_fastqc.html
|           multiqc | Search path : /home/ligeti/gitrepos/t

|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 8/8  


|            fastqc | Found 4 reports
|           multiqc | Compressing plot data
|           multiqc | Deleting    : ../results/trimmed_multiqc/multiqc_report.html   (-f was specified)
|           multiqc | Deleting    : ../results/trimmed_multiqc/multiqc_data   (-f was specified)
|           multiqc | Report      : ../results/trimmed_multiqc/multiqc_report.html
|           multiqc | Data        : ../results/trimmed_multiqc/multiqc_data
|           multiqc | MultiQC complete


In [22]:
multi_qc_results_link = '<h1>QC summary:</h1> <a href="http://{2}/files/{0}/multiqc_report.html"> {1} </a> \
 <p> (For details visit the documentation of multiQC <a href="https://multiqc.info"> https://multiqc.info </a>) </p> <br><br>\
 '.format('gitrepos/turbine-rnaseq-ligeti/results/trimmed_multiqc/','MultiQC results', 'localhost:8888')
display(HTML(multi_qc_results_link))