# intro to HTS data
In this exercise will cover the following:

 - File formats (FASTQ, SAM/BAM, VCF)
 - Mapping (single-end, paired-end) NGS data to a reference sequence
 - Read flags
 - VERY IMPORTANT, you need to identify the 'pipe' button on your computer '|'. That is the character that looks like a vertical bar, on a standard American keyboard this can be found by pressing shift+'button left of enter' or 'button above enter

## Environment setup

In [None]:
# shared tools and data folder
TOOL_PATH=/home/users/shixu/Software # for standalone tool script including java package
SHARED_PATH=/home/users/shixu/shared # For reference database
INPUT_PATH=/home/users/shixu/chinacourse2025/shared_data/NGSIntro  # for input data

In [None]:
NA19238_1_FQ=${INPUT_PATH}/fastq/NA19238.YRI.low_coverage.chr21_1.fq.gz
NA19238_2_FQ=${INPUT_PATH}/fastq/NA19238.YRI.low_coverage.chr21_2.fq.gz
CHR21=${INPUT_PATH}/human/chr21.fa.gz

# JAVA program
PICARD=${TOOL_PATH}/picard.jar

echo --programs that are installed:--
which samtools
which bwa
which angsd
which bcftools
ls $PICARD


echo --Datasets that will be use--\
echo pair of fastQ files
ls $NA19238_1_FQ
ls $NA19238_1_FQ

echo reference genome
ls $CHR21


First make a folder for the exercise and had a symbolic link to the reference genomes and the fastQ files

In [None]:
#make folder and enter it
mkdir -p ~/day1_ngsintro
cd ~/day1_ngsintro

#make links to files and add them to the folder
cp -sf  $NA19238_1_FQ .
cp -sf  $NA19238_2_FQ .
cp -sf  ${CHR21}* .

echo --- files in folder ---
ls

In [None]:
# set up R working space
work_d <- path.expand("~/day1_ngsintro")
setwd(work_d)

In [None]:
# set up python working space
import os
work_d = os.path.expanduser("~/day1_ngsintro")
os.chdir(work_d)

## Mapping one reduced genome
In this exercise you will align a fastq file using bwa and generate a SAM file.

Due to the computational time we have created a reduced genome from one of the individuals from the 1000genomes pilot project. The individual, NA19238, has been sequenced using llumina. For this excerice we have created a reduced reference genome consisting only of chromosome 21 (the smallest human chromosome) and also reduced the sequencing data to reads that will likely map to chromosome 21 with the first 15Mb of the chromosome.  

 The fastQ file NA19238.YRI.low_coverage.chr21_1.fq.gz has variable name with *_1.fq.gz  which is first read or the read pair 
 
 We first want map both read pairs to the reference genome chr21.fa.gz.
 
 
 
### Viewing the input files

view the fastq file (NA19238......1.fq.gz) using the head command and identity the reads and quality scores (ignore the length)


In [None]:
# -n defermines the number of lines printed
gunzip -c NA19238.YRI.low_coverage.chr21_2.fq.gz | head -n 2000


 - identify the read names, the sequence, the base quality scores. Fill in the ???? below
<code>
?????        @SRR794309.186 
?????        GTTGGCGTGGGTGCAGTGATGAGGGAACACTTCTACACTGCTGGTGGGATTGTAAGCTAGTATAGCCACCACAGAAAACAGTGTGGAGATTTCTTAAAGA
        +
?????        CCCFFFFDHHHFHIIJEHIJJHIJJIJIJJJJJJJJJJJJJJJJFHIJGJHIGIJJJJHHEEEHHFFFFFDDDEDDDDDDDCCACDDDDDDDDEDDEDDD
</code>
View the reference fasta file (chr21.fa.gz) using the head command.

The below command count the number of lines in the file
 - How many lines to you have ????
 - How many Reads in the data ????
 - is the number of lines the same in the 2 fastQ files ???? ( modify the code below to see the number of lines in the other file)

In [None]:
gunzip -c NA19238.YRI.low_coverage.chr21_2.fq.gz |  wc -l

View the reference fasta file (chr21.fa.gz) using the head command. The first many bases of the refence genome is all Ns ( unknown bases).
First view the first bases of chr21 then then try to view other parts. You can modify the below uncommented code below



In [None]:
# first 20 lines
gunzip -c chr21.fa.gz | head -n 20

# last 100 lines of the first million  lines (uncomment and modify below)
# gunzip -c chr21.fa.gz | head -n 1000000 | tail -n 100


## Aligning

Align the reads using bwa. We use bwa in the exercises because it is fast and widely used. We first need to index the reference chromosome, followed by the actual aligning process. If should take around 1 min to finish. 


In [None]:
## # we will use the prepared index files
# bwa index chr21.fa.gz

Once the index is made, the second step is to map the reads. There are several ways to do this, but I suggest you use the bwa mem mode, which is the most commonly used these days. Again you can run it with no arguments to get info about how to use it. 

In [None]:
# see options
bwa mem

The number of options may be a bit overwhelming, but you can run it with no additional options, although I suggest you add "-t 5" to run 5 threads if your computer has multiple cores. It reads the compressed fastq files directly, so you need not decompress them. By default the result comes on stdout (in the terminal), so you have to redirect to a file, like the below command. 
We also want to add a read group name with information about where the reads comes from. This is very usefull if you have sequencing data from multiple libraries.  
Now try to align the data


In [None]:
#align the data ( take ~ 1 min)
bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' -t 5 chr21.fa.gz NA19238.YRI.low_coverage.chr21_1.fq.gz NA19238.YRI.low_coverage.chr21_2.fq.gz  > NA19238.sam

Lets look at the generated sam file

In [None]:
# view first 100 lines
head -n100 NA19238.sam

You can read about the sam output here: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session5-alignedReads.html  

 - Identify the header and explain its contents. 
 - For the first read identify the following and fill in the (?????) below
     - the chromosome
     - the position of the first base of the read 
     - The mapping qualty
     - The alignment (cigar string)
     - the insert size (template length)
     - the read(the bases)
     - the base qualities
 <code>
 SRR794309.186	(the name of the read)
 99 			(FLAGS)
 chr21			(?????)
 16239093		(?????
 60				(?????
 100M			(?????
 =				(name of the mate is the same)
 16239300		(position of the mate)
 307			(?????)
 CCTTTTTATGGCTGAGTAGTATTCCACAGTTTCTTTACCCACTCCTTGATCAATAGGCACTTGGGTTGGTTCCACGATTTTGCATTTGTGAATTGTGTTG		(?????)
 CCCFFFFFHHHHHJJJFHIFHHJIJJJJJHIJJJJJIIJJJJJJJJJIHIIJJJJJJJJJJJJJJFHIJHFHHHHFFFDDEEEEEEDEACCEEECDCCDD		(?????)
 NM:i:0	MD:Z:100	MC:Z:100M	AS:i:100	XS:i:26	RG:Z:foo  (TAGS)
 </code>
 
 
 To understand the flags (second column in the sam format) you can type a flag into this page and get the meaning: https://broadinstitute.github.io/picard/explain-flags.html
 


Lets try to find the number of reads  in the samfile.


In [None]:
wc -l NA19238.sam 

Why is it not the same number as in the fastQ file?



Fortunately there are tools to handle sam files, which will make your life easier. We will use the samtools program. First, you often need the compressed version of the sam format, which is called bam. You use samtools view for converting between formats. BAM files faciliates random access to genomic regions, but this requires the file to be sorted and requires  an index this is generated using the command below.
Converting sam to bam is done like this:

In [None]:
#sam to bam
samtools view -b NA19238.sam > NA19238.bam
#sort bam file
samtools sort -o NA19238.sorted.bam NA19238.bam
#index bam file
samtools index NA19238.sorted.bam

#see sizes
echo --- files sizes ---
ls -lah NA19238.sam NA19238.bam NA19238.sorted.bam

The bam file is a compressed version of sam, you can see it is about one-third of the sam file in size. 



We now have a functional alignment file that we can use for analysis. Lets first to view the alignment at different part of the chromosome 21. We will use tview to extact alignment. The option -d -w print  150 bases of the alignment to the terminal

In [None]:
samtools tview NA19238.sorted.bam  -d T -w 150 -p chr21:10002000

In the above the lines are

Line1: The position on chromome 21

Line2: The refence genome ( N if not provided)

Line3: The concensus sequence (If all reads have a G then the concensus is G)

Line4+:  (lines 4,5 ect) the reads alignment


- When looking at the region starting with position chr21:10002000 can you find a possible variable site?
- look at chr21:10028350. Do you think there are problems with the alignment at this position?
- look at chr21:10042151. is this a variable site or is there another likely explanation?

Lets try to add the referecne genome to make it esiaer to see the sequencing error and variable sites

In [None]:
samtools tview NA19238.sorted.bam  -d T -w 150 -p chr21:10042151 chr21.fa.gz

 - How many likelely variable sites can you see?
 - Is is possible that both of the first two sites (sites 10042151 and 10042152) are heterzygoes sites?
 - Have a look at the region staring with 9719896. Do you think the variable sites in this region are reliable (why/why not)?
 
 
 Another way to look at the genome is by generating a [pileup](http://samtools.sourceforge.net/samtools.shtml) format

In [None]:
# see first 1000 sites where there is data
samtools mpileup NA19238.sorted.bam  | head -n 1000


Each line is a position with data.
 - When is this a format particually usefull?
 
 
 From the pileup it is easiy to get the sequencing depth distribution

In [None]:
samtools mpileup NA19238.sorted.bam | cut -f4 | sort -n | uniq -c >dep2
cat dep2


the left column is the number of sites and the right is the depth. 

View the distribution for this individuals using the following R command


In [None]:
depth <- read.table("dep2")
d <- 1:15 #chosen depths to plot

barplot(depth[d+1,1],names=d,xlab="sequencing depth",ylab="Number of sites with sequencing depth ",col="mistyrose")


 - How do you think the depth will affect genotype and variant calling?
 
 First lets view the mpileup with the reference

In [None]:
samtools mpileup -f chr21.fa.gz NA19238.sorted.bam | head -n 1000

 - Can you see the difference compared to not using the referecne genome. 
 - Can you identify a heterozygoes site? (e.g. position 9719896)
 
 
 ### Variant calling 
  Lets create a VCF file for the first couple of MB of chr21. This is done based on the mpileup. There will be much more information tomorrow about how the calling is done using genotype likelihoods. However, before doing so we should remove duplicated reads ( read with the same starting points) as they are likely PCR duplicate

In [None]:
## remove duplicates
samtools rmdup -s NA19238.sorted.bam NA19238.md.bam

## call variants
bcftools mpileup -Ou -f chr21.fa.gz NA19238.md.bam | bcftools call -mv -Ov -o NA19238.vcf

Lets have a look at the VCF file

In [None]:
head -n 200 NA19238.vcf 


 The header of the VCF contains meta information about what it in the file.
In the body of the file
 - Identify the position, the reference allele and the alternative allele of the file.
 - Identify the depth of each position
 - Find a tri-allelic site. Do you believe that it is truely triallelic?
 - How many sites are are called as variable?
 
 
 
 # Bonus exercise (Only do this part if you have finished the rest) 
 ## Bonus exercise 1 -  duplicated reads using Picardtools
 
 bwa actually fills in the mate information, but not all aligners do that, so we can run picard tools to fill in the mate information and sort the file according to position. We will output the file in the binary version of SAM which is BAM

In [None]:

java -jar /course/popgen23/anders/ngsIntro/picard.jar FixMateInformation INPUT=NA19238.sam \
OUTPUT=id.fixmate.srt.bam SORT_ORDER=coordinate

View the header of the BAM file

In [None]:
samtools view -H id.fixmate.srt.bam 

picard didn't update the PG flag, so let us update the header information so that we have documented how we modified the file.

In [None]:

(samtools view -H id.fixmate.srt.bam;echo -e "@PG\tID:fixmate\tPN:fixmate\tVN:2.60\tCL:stuff" ) >newhd
samtools reheader newhd id.fixmate.srt.bam > id.fixmate.srt2.bam


 - Validate that the header in file id.fixmate.srt2.bam  has been updated

In [None]:
samtools view -H id.fixmate.srt2.bam 

Now mark duplicates using picard

In [None]:

java -jar /course/popgen23/anders/ngsIntro/picard.jar MarkDuplicates I=id.fixmate.srt2.bam \
O=id.fixmate.srt.md.bam  M=metrics;


 - Did picard update the PG flag of the header?
 - Did picard update anything else in the header?

NB you can view the header of a bamfile using 'samtools view -H'




In [None]:
samtools view -H id.fixmate.srt.md.bam



## Bonus exercise 2 - clean you bam files using the FLAGS column

The second column in the SAM format is the very important FLAG. This will tell tell you about the state of the paired end mapping, QC duplicates etc.


  
Using the samtools -F/-f you can discard/include flags that fulfill certain patterns. See http://broadinstitute.github.io/picard/explain-flags.html .

  1. How many reads have we marked as duplicate in the final file.
  2. How many properly mapped read pairs do we have? (Where both reads map to the same chr etc).
  3. How many mapped reads do we have ?
  4. How many unmapped reads do we have ?
  5. Find the distribution of the RNAMES of the unmapped reads!?

 Run the following command one at a time by uncommenting them

In [None]:



#samtools view -f 1024 id.fixmate.srt.md.bam|wc -l
#samtools view -f 2 id.fixmate.srt.md.bam|wc -l
#samtools view -F 4 id.fixmate.srt.md.bam|wc -l
#samtools view -f 4 id.fixmate.srt.md.bam|wc -l
# samtools view -f 4 id.fixmate.srt.md.bam|cut -f3|sort -n |uniq -c




Compare with "samtools flagstat" command 


In [None]:
samtools flagstat id.fixmate.srt.md.bam


Make a new bamfile, where you only the reads where both ends maps, and filter out those with a mapping quality below 10, and removing duplicates


In [None]:
samtools view -f 2 -F 1024 id.fixmate.srt.md.bam -q 10 >new.bam