# Mapping reads after quality control

## 1. Mapping to human reference genome
<br>
Download human reference genome GRCh37: https://www.ncbi.nlm.nih.gov/grc/human/data?asm=GRCh37.p5, merging chromosome sequences and saving them as hs_ref_GRCh37.p5.fa
<br>
<br>
<strong> Index human reference genome:</strong>

In [None]:
%%bash
bowtie2-build hs_ref_GRCh37.p5.fa hsGRCh37

Bowtie index created 6 files:<br>
hsGRCh37.1.bt2<br>
hsGRCh37.2.bt2<br>
hsGRCh37.3.bt2<br>
hsGRCh37.4.bt2<br>
hsGRCh37.rev.1.bt2<br>
hsGRCh37.rev.2.bt2<br>

<strong> Mapping and sorting .bam file</strong>

In [None]:
%%bash
bowtie2 --local --no-contain -x hsGRCh37 -1 sample_1_QC2.fq -2 sample_2_QC2.fq -S sample_hs.sam
samtools view -bS -f 2 sample_hs.sam  | samtools sort - -T sample_hs.sorted.nnn -o sample_hs.sorted.bam

Output is sample_hs.sorted.bam
<br>
We extracted the mapped paired-end reads having mapping quality score more than 25 

In [None]:
%%bash
samtools view -h -q 25 -b sample_hs.sorted.bam -o sample_hs.sorted.mapQ25.bam

The length of mapped sequence fragments was investigated, making the cumulative distribution plot:

In [None]:
%%bash
samtools view sample_hs.sorted.mapQ25.bam | awk '$9 > 0 {print $9}' - > sample_hs_fragment_length

In [16]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [24]:
%%R
library(ggplot2)
library(reshape2)
library(dplyr)
CDF_plot <- function(filename){
  bitmap(paste0(filename,".tiff"), res = 600)
  data <- read.table(filename, header = TRUE)
  colnames(data) <- c("fragment_length")
  samples <- rep(filename,nrow(data))
  data <- cbind(samples,data)
  new_data <- data %>% filter(fragment_length %in% (90:220))
  img <- ggplot(new_data, aes(x = new_data$fragment_length),linetype=3) +
    stat_ecdf() +
    theme_bw() +
    xlab("mapped fragment length") +
    scale_x_continuous(breaks=c(0,90,100,110,120,130,140,150,160,170,180,190,200,210,220))
  print(img)
  invisible(dev.off())
}

In [25]:
%%R
CDF_plot("sample_hs_fragment_length")

Extract reads created fragments from 145 to 175 bp long (suitable for the length of nucleosome positioning sequences and it also accounts the most for mapped fragments). Output is .bam file, which is further easier for downstream analysis 

In [None]:
%%bash
#remember to add SQ tag for creating .bam file
samtools view -h sample_hs.sorted.mapQ25.bam | awk 'BEGIN{FS="\t"}{if(145<=$9 && $9<=175 || -175<=$9 && $9<=-145) print $0}' | cat SQ_tag - | samtools view -bS - -o sample_hs.sorted.mapQ25.ex.bam
bamToBed -i sample_hs.sorted.mapQ25.ex.bam > sample_hs.sorted.mapQ25.ex.bamtobed.bed
sort -k4 sample_hs.sorted.mapQ25.ex.bamtobed.bed > sample_hs.sorted.mapQ25.ex.sorted.bed

In [None]:
%%bash
chmod a+x reformat.py
./reformat.py -i sample_hs.sorted.mapQ25.ex.sorted.bed

The output file is sample_hs.sorted.mapQ25.ex.bed
Creating .2bit from human fasta: https://genome.ucsc.edu/goldenpath/help/twoBit.html

In [None]:
%%bash
faToTwoBit hs_ref_GRCh37.p5.fa hs_ref_GRCh37.p5.2bit
twoBitToFa hs_ref_GRCh37.p5.2bit -bed=sample_hs.sorted.mapQ25.ex.bed sample_hs.mapQ25.ex.fa
perl fasta_to_fastq.pl sample_hs.mapQ25.ex.fa > sample_hs.mapQ25.ex.fq