# Description
Single sample workflow
<br>
<br>
**Steps**
- Data pre-processing
- Variants calling by **HaplotypeCaller**
- Variants filtering by **1D_CNN_model**
- Variants annotation by **VEP**

# Preparation

In [1]:
import os

### Make directory

In [2]:
out_dir = "/media/vinbdi/data/tienanh/gatk"
#os.mkdir(out_dir)
out_aln = os.path.join(out_dir,'aln')
#os.mkdir(out_aln)
out_qual = os.path.join(out_dir,'qualification')
#os.mkdir(out_qual)
tmp = os.path.join(out_dir,'tmp')
#os.mkdir(tmp)
out_vcf = os.path.join(out_dir,'vcf')
#os.mkdir(out_vcf)
out_fil = os.path.join(out_dir,'filter')
#os.mkdir(out_fil)

### Input variables

In [3]:
ref_fasta = '/home/vinbdi/Desktop/ref38/hg38.fasta'
dbsnp = '/home/vinbdi/Desktop/ref38/resources-broad-hg38-v0-Homo_sapiens_assembly38.dbsnp138.vcf'
G1_R1 = 'NSAIDS-0011_S1_L001_R1_001.fastq.gz'
G1_R2 = 'NSAIDS-0011_S1_L001_R2_001.fastq.gz'
G2_R1 = 'NSAIDS-0011_S1_L002_R1_001.fastq.gz'
G2_R2 = 'NSAIDS-0011_S1_L002_R2_001.fastq.gz'
G3_R1 = 'NSAIDS-0011_S1_L003_R1_001.fastq.gz'
G3_R2 = 'NSAIDS-0011_S1_L003_R2_001.fastq.gz'
G4_R1 = 'NSAIDS-0011_S1_L004_R1_001.fastq.gz'
G4_R2 = 'NSAIDS-0011_S1_L004_R2_001.fastq.gz'

### Create dictionary containing variables

In [4]:
sample = {}
sample['name']= "NSAIDS_0011"
sample['groups']=[]
sample['groups'].append({'groupname':'L001','read1':G1_R1,'read2':G1_R2})
sample['groups'].append({'groupname':'L002','read1':G2_R1,'read2':G2_R2})
sample['groups'].append({'groupname':'L003','read1':G3_R1,'read2':G3_R2})
sample['groups'].append({'groupname':'L004','read1':G4_R1,'read2':G4_R2})

## A. Data pre-processing
### 1. Mapping reads to the genome reference 
- Map reads to reference genome bằng `bwa mem`
- Dùng `samtools view` để nén *sam* thành *bam*
- `SortSam` được dùng để sort file bam theo coordinate order

In [5]:
for group in sample['groups']:
    #map reads to reference
    group['mappedbam']= os.path.join(out_aln,sample['name']+'_'+group['groupname']+'_mapped.bam')
    cmd = rf"""bwa mem -M \
                -t 20 \
                -R @RG\tID:{group['groupname']}\tSM:{sample['name']}\tLB:lib1\tPL:ILLUMINA \
                {ref_fasta} \
                {group['read1']} \
                {group['read2']} \
                | samtools view -Shb -o {group['mappedbam']}"""
    print(cmd)
    #os.system(cmd)
            
    #sortbam
    group['sortedbam']= os.path.join(out_aln,sample['name']+'_'+group['groupname']+'_sorted.bam')
    cmd = rf"""gatk SortSam \
            -I {group['mappedbam']} \
            -O {group['sortedbam']} \
            -SORT_ORDER coordinate \
            --TMP_DIR {tmp}"""
    print(cmd)
    #os.system(cmd)

bwa mem -M \
                -t 20 \
                -R @RG\tID:L001\tSM:NSAIDS_0011\tLB:lib1\tPL:ILLUMINA \
                /home/vinbdi/Desktop/ref38/hg38.fasta \
                NSAIDS-0011_S1_L001_R1_001.fastq.gz \
                NSAIDS-0011_S1_L001_R2_001.fastq.gz \
                | samtools view -Shb -o /media/vinbdi/data/tienanh/gatk/aln/NSAIDS_0011_L001_mapped.bam
gatk SortSam \
            -I /media/vinbdi/data/tienanh/gatk/aln/NSAIDS_0011_L001_mapped.bam \
            -O /media/vinbdi/data/tienanh/gatk/aln/NSAIDS_0011_L001_sorted.bam \
            -SORT_ORDER coordinate \
            --TMP_DIR /media/vinbdi/data/tienanh/gatk/tmp
bwa mem -M \
                -t 20 \
                -R @RG\tID:L002\tSM:NSAIDS_0011\tLB:lib1\tPL:ILLUMINA \
                /home/vinbdi/Desktop/ref38/hg38.fasta \
                NSAIDS-0011_S1_L002_R1_001.fastq.gz \
                NSAIDS-0011_S1_L002_R2_001.fastq.gz \
                | samtools view -Shb -o /media/vinbdi/data/tienanh/gatk/aln/NS

### 2. Marking Duplicates
- Input là 4 file sorted bam của 4 readgroups
- Output là 1 file marked bam chung cho sample

In [6]:
sample['markedbam']= os.path.join(out_qual,sample['name']+'_marked.bam')
sample['metrics']= os.path.join(out_qual,sample['name']+'_metrics.txt')

cmd = rf"""gatk MarkDuplicates \
            -I {sample['groups'][0]['sortedbam']} \
            -I {sample['groups'][1]['sortedbam']} \
            -I {sample['groups'][2]['sortedbam']} \
            -I {sample['groups'][3]['sortedbam']} \
            -O {sample['markedbam']} \
            -M {sample['metrics']} \
            --TMP_DIR {tmp}"""
print(cmd)
#os.system(cmd)

gatk MarkDuplicates \
            -I /media/vinbdi/data/tienanh/gatk/aln/NSAIDS_0011_L001_sorted.bam \
            -I /media/vinbdi/data/tienanh/gatk/aln/NSAIDS_0011_L002_sorted.bam \
            -I /media/vinbdi/data/tienanh/gatk/aln/NSAIDS_0011_L003_sorted.bam \
            -I /media/vinbdi/data/tienanh/gatk/aln/NSAIDS_0011_L004_sorted.bam \
            -O /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_marked.bam \
            -M /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_metrics.txt \
            --TMP_DIR /media/vinbdi/data/tienanh/gatk/tmp


### 3. Recalibrating Base Quality Score
`--known-sites` hiện tại đang chỉ sử dụng **dbSNP**

In [7]:
sample['recaltable']= os.path.join(out_qual,sample['name']+'_recal.table')
cmd = rf"""gatk BaseRecalibrator \
        -I {sample['markedbam']} \
        -R {ref_fasta} \
        --known-sites {dbsnp} \
        -O {sample['recaltable']}"""
print(cmd)
#os.system(cmd)

sample['arrbam']= os.path.join(out_qual,sample['name']+'_arr.bam')
cmd = rf"""gatk ApplyBQSR \
            -R {ref_fasta} \
            -I {sample['markedbam']} \
            --bqsr-recal-file {sample['recaltable']} \
            -O {sample['arrbam']}"""
print(cmd)
#os.system(cmd)

gatk BaseRecalibrator \
        -I /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_marked.bam \
        -R /home/vinbdi/Desktop/ref38/hg38.fasta \
        --known-sites /home/vinbdi/Desktop/ref38/resources-broad-hg38-v0-Homo_sapiens_assembly38.dbsnp138.vcf \
        -O /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_recal.table
gatk ApplyBQSR \
            -R /home/vinbdi/Desktop/ref38/hg38.fasta \
            -I /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_marked.bam \
            --bqsr-recal-file /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_recal.table \
            -O /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_arr.bam


# B. Variants calling
`HaplotypeCaller` in single sample mode

In [8]:
sample['vcf']= os.path.join(out_vcf,sample['name']+'.vcf')
sample['bamout']= os.path.join(out_vcf,sample['name']+'_out.bam')
cmd = rf"""gatk HaplotypeCaller \
            -R {ref_fasta} \
            -I {sample['arrbam']} \
            -O {sample['vcf']} \
            -bamout {sample['bamout']}"""
print(cmd)
#os.system(cmd)

gatk HaplotypeCaller \
            -R /home/vinbdi/Desktop/ref38/hg38.fasta \
            -I /media/vinbdi/data/tienanh/gatk/qualification/NSAIDS_0011_arr.bam \
            -O /media/vinbdi/data/tienanh/gatk/vcf/NSAIDS_0011.vcf \
            -bamout /media/vinbdi/data/tienanh/gatk/vcf/NSAIDS_0011_out.bam


# C. Variants filtering

`CNNScoreVariants` chạy bình thường bị lỗi liên quan đến python dependencies, em đã thử tạo 1 conda gatk environment 
```bash
conda env create -f gatkcondaenv.yml
```
Tuy nhiên vẫn gặp lỗi nên em chuyển sang dùng docker
```bash
docker run -t -i -v /home/ted/ubuntu/ADR:/gatk/my_data broadinstitute/gatk:4.1.3.0
```
```bash
gatk CNNScoreVariants \
    -R /gatk/my_data/ref/Homo_sapiens_assembly38.fasta \
    -V /gatk/my_data/data/NSAIDS_0011.vcf \
    -O /gatk/my_data/data/NSAIDS_0011.1d_cnn_scored.vcf
```

`FilterVariantTranches` sử dụng resource là __1000G_omni2.5__ và __hapmap_3.3__
```bash
gatk FilterVariantTranches \
    -V /gatk/my_data/data/NSAIDS_0011.1d_cnn_scored.vcf \
    -O /gatk/my_data/data/NSAIDS_0011.1d_cnn_filtered.vcf \
    --resource /gatk/my_data/ref/1000G_omni2.5.hg38.vcf.gz \
    --resource /gatk/my_data/ref/hapmap_3.3.hg38.vcf.gz \
    --info-key CNN_1D \
    --snp-tranche 99.9 \
    --indel-tranche 95.0
```

# D. Variants annotation by VEP
Vì file vcf lớn không chạy trên web được, nên em chạy bằng command line tool <br>
Install khá phức tạp vì cần nhiều dependencies, nên em dùng docker <br>
Download cache rất nặng và chậm

```bash
docker run -it -v /home/ted/ubuntu/ADR:/opt/vep/.vep ensemblorg/ensembl-vep

perl INSTALL.pl -a cf -s homo_sapiens -y GRCh38

./vep \
    -i /opt/vep/.vep/NSAIDS_0011.1d_cnn_filtered.vcf \
    -o /opt/vep/.vep/NSAIDS_0011.1d_cnn_filtered.txt \
    --offline
```
Kết quả [VEP summary](./data/NSAIDS_0011.1d_cnn_filtered.txt_summary.html)