## Prepare Positive Samples

### Prepare FASTQ files

##### Download Datasets from SRA

In [None]:
!wget -O ./data/raw/SRR458758 phttps://sra-pub-run-odp.s3.amazonaws.com/sra/SRR458758/SRR458758 # CLIP-seq (Ab: 35L33G)
!wget -O ./data/raw/SRR458759 https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR458759/SRR458759 # CLIP-seq (Ab: 2J3)
!wget -O ./data/raw/SRR458760 https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR458760/SRR458760 # CLIP-seq (Ab: polyclonal)

##### Convert SRA files into FASTQ files

In [None]:
!fastq-dump -O ./data/fastq/ ./data/raw/SRR458758
!fastq-dump -O ./data/fastq/ ./data/raw/SRR458759
!fastq-dump -O ./data/fastq/ ./data/raw/SRR458760

##### Compress FASTQ files

In [None]:
!bgzip -@ 6 ./data/fastq/SRR458758.fastq
!bgzip -@ 6 ./data/fastq/SRR458759.fastq
!bgzip -@ 6 ./data/fsatq/SRR458760.fastq

### Quality Control

##### Check with FastQC

In [None]:
!fastqc ./data/fastq/SRR458758.fastq.gz -o ./data/fastq/fastqc
!fastqc ./data/fastq/SRR458759.fastq.gz -o ./data/fastq/fastqc
!fastqc ./data/fastq/SRR458760.fastq.gz -o ./data/fastq/fastqc

##### Adapter Trimming

In [None]:
!cd ./data/fastq

!bbduk.sh in=SRR458758.fastq.gz out=clean.SRR458758.fastq.gz \
    ref=/Users/jigsaw-0/Workspace/Bioinformatics/tools/bbmap/resources/adapters.fa \
        ktrim=r k=23 mink=11 hdist=1 tpe tbo

!bbduk.sh in=SRR458759.fastq.gz out=clean.SRR458759.fastq.gz \
    ref=/Users/jigsaw-0/Workspace/Bioinformatics/tools/bbmap/resources/adapters.fa \
        ktrim=r k=23 mink=11 hdist=1 tpe tbo

!bbduk.sh in=SRR458760.fastq.gz out=clean.SRR458760.fastq.gz \
    ref=/Users/jigsaw-0/Workspace/Bioinformatics/tools/bbmap/resources/adapters.fa \
        ktrim=r k=23 mink=11 hdist=1 tpe tbo

##### Base Quality Filtering

In [None]:
!bbduk.sh in=clean.SRR458758.fastq.gz out=clean2.SRR458758.fastq.gz qtrim=r trimq=10 maq=10
!bbduk.sh in=clean.SRR458759.fastq.gz out=clean2.SRR458759.fastq.gz qtrim=r trimq=10 maq=10
!bbduk.sh in=clean.SRR458760.fastq.gz out=clean2.SRR458760.fastq.gz qtrim=r trimq=10 maq=10

##### Filtered Fragment Length Statistics

In [None]:
# Since it is single-end sequncing, these statistics will be used in downstream
# use readlength.sh from bbmap

!readlength.sh in=clean2.SRR458758.fastq.gz out=histogram1.txt bin=10 max=80000 # Avg : 53.1, Std : 17.9
!readlength.sh in=clean2.SRR458759.fastq.gz out=histogram2.txt bin=10 max=80000
!readlength.sh in=clean2.SRR458760.fastq.gz out=histogram3.txt bin=10 max=80000

##### Collapsing Duplicated Read

In [None]:
'''
# collapsing reads of 20nt or longer to generate unique set of sequences
!zless clean2.SRR458758.fastq.gz | awk 'NR%4==2' | awk 'length>19' | sort | uniq -c > uniq.SRR458758.txt
!zless clean2.SRR458758.fastq.gz | awk 'NR%4==2' | awk 'length<20' > small.SRR458758.txt
!cat uniq.SRR458758.txt small.SRR458758.txt > final.SRR458758.txt 

!zless clean2.SRR458759.fastq.gz | awk 'NR%4==2' | awk 'length>19' | sort | uniq -c > uniq.SRR458759.txt
!zless clean2.SRR458759.fastq.gz | awk 'NR%4==2' | awk 'length<20' > small.SRR458759.txt
!cat uniq.SRR458759.txt small.SRR458759.txt > final.SRR458759.txt 

!zless clean2.SRR458760.fastq.gz | awk 'NR%4==2' | awk 'length>19' | sort | uniq -c > uniq.SRR458760.txt
!zless clean2.SRR458760.fastq.gz | awk 'NR%4==2' | awk 'length<20' > small.SRR458760.txt
!cat uniq.SRR458760.txt small.SRR458760.txt > final.SRR458760.txt
'''

In [None]:
!zless clean2.SRR458758.fastq.gz | awk 'NR%4==2' | sort | uniq -c > uniq.SRR458758.txt
!zless clean2.SRR458759.fastq.gz | awk 'NR%4==2' | sort | uniq -c > uniq.SRR458759.txt
!zless clean2.SRR458760.fastq.gz | awk 'NR%4==2' | sort | uniq -c > uniq.SRR458760.txt

##### Alignment

In [None]:
# 1. Building an Index
# reference transcript file source : https://www.ncbi.nlm.nih.gov/genome/?term=txid10090[orgn]

!kallisto index -i transcripts.idx GCF_000001635.27_GRCm39_rna.fna.gz

In [None]:
# 2. Pseudoalignment (Quantification)
# annotation file & chromosome info source : https://www.ncbi.nlm.nih.gov/genome/?term=txid10090[orgn]

!kallisto quant --single -i transcripts.idx -o kali_out --genomebam \
    --gtf GCF_000001635.27_GRCm39_genomic.gff.gz --chromosomes chr.txt \
        --fragment-length=53 --sd=18 \
            clean2.SRR458758.fastq.gz

##### Identify reads mapped to reference transcripts

## Prepare Negative Samples

# Generating Random Samples