# FASTQ QC

## Aims of this notebook
- Trim the FASTQ by base quality, remove potential illumina adapter that may cause problem in mapping, and filter out short reads
- Update the metadata table using new file names

## Installation

### Install cutadapt, fastqc
`conda install -n genome_book -c bioconda cutadapt fastqc`

### Install trim_galore
see [trim galore github](https://github.com/FelixKrueger/TrimGalore) for instruction.
```{shell}
curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.6.5.tar.gz -o trim_galore.tar.gz
tar xvzf trim_galore.tar.gz
# Run Trim Galore
./TrimGalore-0.6.5/trim_galore
```

In [1]:
import pandas as pd
import pathlib
import subprocess

In [2]:
fastq_meta = pd.read_csv('./data/fastq/fastq_metadata.csv', index_col=0)
fastq_meta

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time,file_name
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5,forebrain_E11.5_1_ENCFF329ACL.fastq.gz
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF251LNG.fastq.gz
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,forebrain_E11.5_2_ENCFF896COV.fastq.gz
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5,forebrain_E13.5_2_ENCFF959PSX.fastq.gz
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5,forebrain_E13.5_1_ENCFF235DNM.fastq.gz
ENCFF270GKY,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,forebrain_E14.5_1_ENCFF270GKY.fastq.gz
ENCFF460TCF,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,forebrain_E14.5_1_ENCFF460TCF.fastq.gz
ENCFF126IRS,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,forebrain_E14.5_2_ENCFF126IRS.fastq.gz
ENCFF748SRJ,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,forebrain_E14.5_2_ENCFF748SRJ.fastq.gz
ENCFF447EXU,reads,ENCSR362AIZ,UBERON:0001890,forebrain,2,P0,forebrain_P0_2_ENCFF447EXU.fastq.gz


In [3]:
output_dir = pathlib.Path('./data/fastq/trimmed/').absolute()
# another way to make dir
output_dir.mkdir(exist_ok=True)

fastq_dir = pathlib.Path('./data/fastq/').absolute()

## Run commands sequentially

In [4]:
# change this to your path to trim_galore installation
trim_galore_path = '/Users/hq/Documents/pkg/TrimGalore-0.6.5/trim_galore'

# iterate through each file in the fastq_meta
for file_name in fastq_meta['file_name']:
    file_path = fastq_dir / file_name
    # make the command using f-string
    command = f'{trim_galore_path} {file_path} --fastqc -o {output_dir}'
    # execute the command using subprocess
    subprocess.run(command, shell=True, check=True)
    
    print(command)
    print(file_name, 'trim finished.')

/Users/hq/Documents/pkg/TrimGalore-0.6.5/trim_galore /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/forebrain_E11.5_1_ENCFF329ACL.fastq.gz --fastqc -o /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed
forebrain_E11.5_1_ENCFF329ACL.fastq.gz trim finished.
/Users/hq/Documents/pkg/TrimGalore-0.6.5/trim_galore /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/forebrain_E11.5_2_ENCFF251LNG.fastq.gz --fastqc -o /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed
forebrain_E11.5_2_ENCFF251LNG.fastq.gz trim finished.
/Users/hq/Documents/pkg/TrimGalore-0.6.5/trim_galore /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/forebrain_E11.5_2_ENCFF896COV.fastq.gz --fastqc -o /Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed
forebrain_E11.5_2_ENCFF896COV.fastq.gz trim finished.
/Users/hq/Documents/pkg/TrimGalore-0.6.5/trim_galore

## [OPTIONAL] Run commands in parallel
- When you have multiple samples, it is important to learn how to execute commands in parallel, so you can utilize power of your multi-core laptop or server.

In [5]:
# Based on the number of cores your computer has
process = 4

# this is one way to run things in parallel in python
from concurrent.futures import ProcessPoolExecutor, as_completed
import subprocess

In [6]:
trim_galore_path = '/Users/hq/Documents/pkg/TrimGalore-0.6.5/trim_galore'
with ProcessPoolExecutor(process) as executor:
    #executor is a process pool, it will control the number of processes run in parallel
    
    futures = {}
    for file_name in fastq_meta['file_name']:
        file_path = fastq_dir / file_name
        command = f'{trim_galore_path} {file_path} --fastqc -o {output_dir}'
        
        # executor.submit a job (run a function with certain parameter, here the function is subprocess.run) 
        # and return a future obj. The future obj refer to this job, save it into futures first
        future = executor.submit(subprocess.run, command, shell=True, check=True)
        futures[future] = file_name
    
    # as_complete determine which future is finished
    for future in as_completed(futures):
        # get back the file name associated to this future
        file_name = futures[future]
        
        # this line is important, it check wheter the job finished without error
        # it will also got the job (subprocess.run) return, here that's not important
        _ = future.result()
        
        print(file_name, 'trim finished.')

forebrain_E13.5_2_ENCFF959PSX.fastq.gz trim finished.
forebrain_E11.5_1_ENCFF329ACL.fastq.gz trim finished.
forebrain_E11.5_2_ENCFF251LNG.fastq.gz trim finished.
forebrain_E11.5_2_ENCFF896COV.fastq.gz trim finished.
forebrain_E13.5_1_ENCFF235DNM.fastq.gz trim finished.
forebrain_E14.5_1_ENCFF460TCF.fastq.gz trim finished.
forebrain_E14.5_1_ENCFF270GKY.fastq.gz trim finished.
forebrain_E14.5_2_ENCFF126IRS.fastq.gz trim finished.
forebrain_E14.5_2_ENCFF748SRJ.fastq.gz trim finished.
forebrain_P0_1_ENCFF037JQC.fastq.gz trim finished.
forebrain_P0_2_ENCFF447EXU.fastq.gz trim finished.
forebrain_P0_2_ENCFF458NWF.fastq.gz trim finished.
forebrain_E15.5_2_ENCFF891HIX.fastq.gz trim finished.
forebrain_E12.5_2_ENCFF203BWA.fastq.gz trim finished.
forebrain_P0_1_ENCFF358MFI.fastq.gz trim finished.
forebrain_E15.5_1_ENCFF179JEC.fastq.gz trim finished.
forebrain_E10.5_1_ENCFF920CNZ.fastq.gz trim finished.
forebrain_E12.5_1_ENCFF294JRP.fastq.gz trim finished.
forebrain_E12.5_1_ENCFF920QAY.fastq.gz t

## Make metadata for trimmed fastq

In [9]:
# find out all the trimmed fastq, make a dict
fastq_list = list(output_dir.glob('*trimmed.fq.gz'))
fastq_list[:5]

[PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed/forebrain_E12.5_1_ENCFF920QAY_trimmed.fq.gz'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed/forebrain_E12.5_1_ENCFF294JRP_trimmed.fq.gz'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed/forebrain_P0_2_ENCFF447EXU_trimmed.fq.gz'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed/forebrain_E11.5_2_ENCFF896COV_trimmed.fq.gz'),
 PosixPath('/Users/hq/Documents/pkg/py_genome_sci_book/analysis/salmon_demo/data/fastq/trimmed/forebrain_E15.5_1_ENCFF179JEC_trimmed.fq.gz')]

In [10]:
# replace the fastq path with trimmed ones
fastq_series = pd.Series({i.name.split('_')[3]: str(i) for i in fastq_list})
fastq_meta['file_name'] = fastq_series

In [11]:
fastq_meta

Unnamed: 0_level_0,count_type,experiment_id,bio_sample_id,tissue,replicate,dev_time,file_name
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENCFF329ACL,reads,ENCSR160IIN,UBERON:0001890,forebrain,1,E11.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF251LNG,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF896COV,reads,ENCSR160IIN,UBERON:0001890,forebrain,2,E11.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF959PSX,reads,ENCSR970EWM,UBERON:0001890,forebrain,2,E13.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF235DNM,reads,ENCSR970EWM,UBERON:0001890,forebrain,1,E13.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF270GKY,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF460TCF,reads,ENCSR185LWM,UBERON:0001890,forebrain,1,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF126IRS,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF748SRJ,reads,ENCSR185LWM,UBERON:0001890,forebrain,2,E14.5,/Users/hq/Documents/pkg/py_genome_sci_book/ana...
ENCFF447EXU,reads,ENCSR362AIZ,UBERON:0001890,forebrain,2,P0,/Users/hq/Documents/pkg/py_genome_sci_book/ana...


In [12]:
fastq_meta.to_csv('data/fastq/trimmed/trimmed_fastq_metadata.csv')

In [14]:
!ls -hl data/fastq/trimmed/

total 90016
-rw-r--r--  1 hq  staff   2.8K May 20 15:28 forebrain_E10.5_1_ENCFF320FJX.fastq.gz_trimming_report.txt
-rw-r--r--  1 hq  staff   669K May 20 15:28 forebrain_E10.5_1_ENCFF320FJX_trimmed.fq.gz
-rw-r--r--@ 1 hq  staff   633K May 20 15:28 forebrain_E10.5_1_ENCFF320FJX_trimmed_fastqc.html
-rw-r--r--  1 hq  staff   352K May 20 15:28 forebrain_E10.5_1_ENCFF320FJX_trimmed_fastqc.zip
-rw-r--r--  1 hq  staff   2.8K May 20 15:28 forebrain_E10.5_1_ENCFF920CNZ.fastq.gz_trimming_report.txt
-rw-r--r--  1 hq  staff   672K May 20 15:28 forebrain_E10.5_1_ENCFF920CNZ_trimmed.fq.gz
-rw-r--r--@ 1 hq  staff   640K May 20 15:28 forebrain_E10.5_1_ENCFF920CNZ_trimmed_fastqc.html
-rw-r--r--  1 hq  staff   360K May 20 15:28 forebrain_E10.5_1_ENCFF920CNZ_trimmed_fastqc.zip
-rw-r--r--  1 hq  staff   2.9K May 20 15:28 forebrain_E10.5_2_ENCFF528EVC.fastq.gz_trimming_report.txt
-rw-r--r--  1 hq  staff   667K May 20 15:28 forebrain_E10.5_2_ENCFF528EVC_trimmed.fq.gz
-rw-r--r--@ 1 hq  staff   633K

## Output of this notebook

**In the ./data/fastq/trimmed dir**

1. We have 25 trimmed FASTQ files
2. We have a updated metadata table recording all sample and database ID informations for these 25 files.