# intro to HTS data
In this exercise will cover the following:

 - File formats (FASTQ, SAM/BAM, VCF)
 - Mapping (single-end, paired-end) NGS data to a reference sequence
 - Read flags
 - VERY IMPORTANT, you need to identify the 'pipe' button on your computer '|'. That is the character that looks like a vertical bar, on a standard American keyboard this can be found by pressing shift+'button left of enter' or 'button above enter
 

In this exercise you will align a fastq file using bwa and generate a SAM file.

Due to the computational time we have created a reduced genome from one of the individuals from the wildebeest  project. The individual, CTauTzS_8872, has been sequenced using short read sequencing. For this exercise we have created a reduced reference genome.  

 The fastQ file CCTauTzS_8872.Goat.small.fq_1.gz has variable name with *_1.fq.gz  which is first read or the read pair.  
 
 
 ### running jupyter
 use Ctrl+ENTER to run code. 
 

In [None]:
FASTQ_1=/davidData/data/course/kenyaWorkshop/anders/NGSintro_day1/CCTauTzS_8872.Goat.small.fq_1.gz
FASTQ_2=/davidData/data/course/kenyaWorkshop/anders/NGSintro_day1/CCTauTzS_8872.Goat.small.fq_2.gz
GOAT_REF=/davidData/data/course/kenyaWorkshop/anders/NGSintro_day1/goat.fa.gz

# JAVA program
PICARD=/course/popgen23/anders/ngsIntro/picard.jar
echo --programs that are installed:--
which samtools
which bwa
which angsd
which bcftools
ls $PICARD

echo; echo -Datasets that will be used-
echo ---pair of fastQ files
ls $FASTQ_1
ls $FASTQ_2
echo ---reference genome with index
ls $GOAT_REF*




First make a folder for the exercise and had a symbolic link to the reference genomes and the fastQ files

In [None]:
#make folder 
mkdir -p ~/kenya2024/
mkdir -p ~/kenya2024/NGSintro

# enter folder
cd ~/kenya2024/NGSintro

##make links to files and add them to the folder
# links to the two fastQ files
cp -sf  $FASTQ_1 .
cp -sf  $FASTQ_2 .
# link to reference genome
cp -sf  $GOAT_REF* .


echo --- files in folder ---
ls 





Before we start mapping we want to perform some QC of the data. 
 
# step 1: FastQ file and QC
### Viewing the input files (fastQ file)


view the fastq file (CTauTzS_8872_subset_R1.fastq.gz) using the head command and identity the reads and quality scores (ignore the Broken pipe warning)


In [None]:
# -n defermines the number of lines printed
gunzip -c CCTauTzS_8872.Goat.small.fq_1.gz | head -n 12


### run code below to start quiz

In [1]:
# run to start quiz       
from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/popgenDK/courses/main/kenya2024/exercises/day1_NGSintro/quiz1.json')

<IPython.core.display.Javascript object>


The below command count the number of lines in the file


In [None]:
gunzip -c CCTauTzS_8872.Goat.small.fq_1.gz |  wc -l

In [None]:
# run to start quiz

from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/popgenDK/courses/main/kenya2024/exercises/day1_NGSintro/quiz2.json')


#### reference fasta file

View the reference fasta file (goat.fa.gz) using the head command. You can modify the below uncommented code below to view other parts of the reference



In [None]:
# first 20 lines
zcat goat.fa.gz  | head -n 20

# last 1000 lines of the first million  lines (uncomment and modify below)
# gunzip -c goat.fa.gz 2>/dev/null | head -n 1000000 | tail -n 1000

#### fastqc

lets see if there is any issues with the sequencing reads

In [None]:
 fastqc --nogroup CCTauTzS_8872.Goat.small.fq_1.gz
 
 echo ---- fastQC has created this file ----
 ls *html

To view the swich to the main browser tab for jypiter notebook. Enter the folder /kenya2024/NGSintro/ and find the html file. Click the file to open the fastQC report


![fastQC file](https://github.com/popgenDK/courses/blob/main/kenya2024/exercises/day1_NGSintro/fastQCfile.png?raw=true)

In [2]:
# run to start quiz       
from jupyterquiz import display_quiz
display_quiz('https://raw.githubusercontent.com/popgenDK/courses/main/kenya2024/exercises/day1_NGSintro/quiz3.json')

<IPython.core.display.Javascript object>


# mapping / Aligning

Align the reads using bwa. We use bwa in the exercises because it is fast and widely used. We first need to index the reference chromosome, followed by the actual aligning process. If should take around 1 min to finish. 


Once the index is made, the second step is to map the reads. There are several ways to do this, but I suggest you use the bwa mem mode, which is the most commonly used these days. Again you can run it with no arguments to get info about how to use it. 

In [None]:
# see options
bwa mem

The number of options may be a bit overwhelming, but you can run it with no additional options, although I suggest you add "-t 5" to run 5 threads if your computer has multiple cores. It reads the compressed fastq files directly, so you need not decompress them. By default the result comes on stdout (in the terminal), so you have to redirect to a file, like the below command. 
We also want to add a read group name with information about where the reads comes from. This is very usefull if you have sequencing data from multiple libraries.  
Now try to align the data


In [None]:
# bwa command 
# bwa men -R readGroupName -t threads REF fastq_1 fast1_2

#align the data ( take ~ 1 min)
bwa mem -R '@RG\tID:foo\tSM:bar\tLB:library1' -t 5 goat.fa.gz CCTauTzS_8872.Goat.small.fq_1.gz CCTauTzS_8872.Goat.small.fq_2.gz > CTauTzS_8872.sam

Wait til it done - if there is not output it is still running and you will see [*]

Lets look at the generated sam file ( ignore the warnings )

In [None]:


# view first 1 line of the sam file
samtools view CTauTzS_8872.sam | head -n 1


You can read about the sam output here: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session5-alignedReads.html  

 - Identify the header and explain its contents. 
 - For the first read identify the following and fill in the (?????) below
     - the chromosome
     - the position of the first base of the read 
     - The mapping qualty
     - The alignment (cigar string)
     - the insert size (template length)
     - the read(the bases)
     - the base qualities



 <code>
FP200000259BRL1C001R0010206629	(the name of the read)
99                  			(FLAGS)
NC_030808.1		            	(?????)
760750	                		(?????)
60              				(?????)
150M			                (?????)
=	                			(name of the mate is the same)
760946                  		(position of the mate)
346                 			(?????)
CATACATACACAAGCATACTACACCT....	(?????)
FDFBFDFEFEFCEDEFFEEGEFEEEE...	(?????)
 NM:i:2	MD:Z:21G89G38 ....      (TAGS)
 </code>
 


 
 To understand the flags (second column in the sam format) you can type a flag into this page and get the meaning: https://broadinstitute.github.io/picard/explain-flags.html
 


Lets try to find the number of reads  in the samfile.

In [None]:
wc -l CTauTzS_8872.sam

- Why is it not the same number as in the fastQ file?



Fortunately there are tools to handle sam files, which will make your life easier. We will use the samtools program. First, you often need the compressed version of the sam format, which is called bam. You use samtools view for converting between formats. BAM files faciliates random access to genomic regions, but this requires the file to be sorted and requires  an index this is generated using the command below.
Converting sam to bam is done like this:

In [None]:
#sam to bam
samtools view -b CTauTzS_8872.sam > CTauTzS_8872.bam
#sort bam file
samtools sort -o CTauTzS_8872.sorted.bam CTauTzS_8872.bam
#index bam file
samtools index CTauTzS_8872.sorted.bam

#see sizes
echo --- files sizes ---
ls -lah CTauTzS_8872.sam CTauTzS_8872.bam CTauTzS_8872.sorted.bam

The bam file is a compressed version of sam, you can see it is about one-third of the sam file in size. 



We now have a functional alignment file that we can use for analysis. Lets first to view the alignment at different part of the chromosome NC_030808.1. We will use tview to extact alignment. The option -d -w print  100 bases of the alignment to the terminal

In [None]:
samtools tview  CTauTzS_8872.sorted.bam  -d T -w 100 -p NC_030808.1:130171


In the above the lines are

Line1: The position on chromome NC_030808.1

Line2: The refence genome ( N if not provided)

Line3: The concensus sequence (If most or all reads have a G then the concensus is G)

Line4+:  (lines 4,5 ect) the reads alignment 


- When looking at the region starting with position NC_030808.1:130171 can you find a possible variable site?


Lets try to add the referecne genome to make it easier to see the sequencing error and variable sites

In [None]:
samtools tview CTauTzS_8872.sorted.bam  -d T -w 100 -p NC_030808.1:130161 goat.fa.gz

 - can you find the site that is likely heterozygous?
 
 Some parts of the genome are hard to map to. Lets try another postion
 - Change the position to NC_030808.1:156221. (modify above code a run)
 - How many likelely variable sites can you see?
 - Are these variable sites or is there another likely explanation?

 


In [None]:
samtools mpileup CTauTzS_8872.sorted.bam  | cut -f4 | sort -n | uniq -c >dep1

cat dep1


the left column is the number of sites and the right is the depth. 

View the distribution for this individuals using the following R command


In [None]:

depth <- read.table("~/kenya2024/NGSintro/dep1")
d <- 1:15 #chosen depths to plot

barplot(depth[d+1,1],names=d,xlab="sequencing depth",ylab="Number of sites with sequencing depth ",col="mistyrose")


 - How do you think the depth will affect genotype and variant calling?
 



 
# Variant calling 


### create VCF file
Lets create a VCF file for the first MB of CTauTzS_8872.sorted.bam. This is done using bcftools. The ploidy is diploid (2) for mammals but otherwise we use the defaul settings

In [None]:
## remove duplicates
#samtools rmdup -s CTauTzS_8872.sorted.bam CTauTzS_8872.md.bam

## call variants
bcftools mpileup -Ou -f goat.fa.gz CTauTzS_8872.sorted.bam -r NC_030808.1:1-1000000 | bcftools call --ploidy 2 -mv -a GQ  -Ov -o CTauTzS_8872.vcf

Lets have a look at the VCF file



In [None]:
head -n 100 CTauTzS_8872.vcf 


 The header of the VCF contains meta information about what it in the file.
In the body of the file
 - Identify the position, the reference allele and the alternative allele of the file.
 - Identify the depth of each position
 - Identify the genotype quality for each genotype call
 
 There a many sites with too little information to call variants. Let apply some light filters. Here we remove sites with less than 8 reads and sites with a low quality score


In [None]:
bcftools filter CTauTzS_8872.vcf -e 'QUAL<20 || DP < 8' > CTauTzS_8872.filt.vcf

head -n 100 CTauTzS_8872.filt.vcf 


 
 
 - How many sites are are called as variable?
 - Find a heterozygous site.  ( look for 0/1)


# Bonus exercise (Only do this part if you have finished the rest) 

 Look at the alignment for the position
 


In [None]:
POS=6833

#the site you choose is the first base of the alignment so we center it by subtracting 50 to the position
POS50=$(($POS  - 50 ))
samtools tview  CTauTzS_8872.sorted.bam  -d T -w 100 -p NC_030808.1:$POS50



We can use the mpileup option to get a summary of the data at that position

In [None]:
  
echo -e "CHR\tPOS\tREF\tDEPTH\tBASES\tbaseQuality" > mpileup.file
samtools mpileup  CTauTzS_8872.sorted.bam -r NC_030808.1:$POS-$POS >> mpileup.file

echo --- pileup of the site which shows the bases and their score ---
column -t mpileup.file


 - At that position count the number of bases of different types
   - #A = ??
   - #C = ??
   - #G = ??
   - #T = ??

 - Calculate the genotype likelihoods and call the genotype for the site. 
 
You can use the shiny app to help with the calculations for calculating genotype likelihoods using the GATK model. Enter the BASES and their base qualities

https://popgen.dk/shiny/anders/GL/


The resulting values will not be exactly the same as in bcftools since bcftools uses a slightly different way of calculating the genotype likelihoods than the GATK model. 


 
 # Bonus exercise (Only do this part if you have finished the rest) 
 ## Bonus exercise 2 -  duplicated reads using Picardtools
 
 bwa actually fills in the mate information, but not all aligners do that, so we can run picard tools to fill in the mate information and sort the file according to position. We will output the file in the binary version of SAM which is BAM

In [None]:
java -jar /course/popgen23/anders/ngsIntro/picard.jar FixMateInformation INPUT=CTauTzS_8872.sam \
OUTPUT=id.fixmate.srt.bam SORT_ORDER=coordinate

View the header of the BAM file

In [None]:
samtools view -H id.fixmate.srt.bam 

picard didn't update the PG flag, so let us update the header information so that we have documented how we modified the file.

In [None]:
(samtools view -H id.fixmate.srt.bam;echo -e "@PG\tID:fixmate\tPN:fixmate\tVN:2.60\tCL:stuff" ) >newhd
samtools reheader newhd id.fixmate.srt.bam > id.fixmate.srt2.bam



 - Validate that the header in file id.fixmate.srt2.bam  has been updated

In [None]:
samtools view -H id.fixmate.srt2.bam 

Now mark duplicates using picard

In [None]:
java -jar /course/popgen23/anders/ngsIntro/picard.jar MarkDuplicates I=id.fixmate.srt2.bam \
O=id.fixmate.srt.md.bam  M=metrics;

 - Did picard update the PG flag of the header?
 - Did picard update anything else in the header?

NB you can view the header of a bamfile using 'samtools view -H'




In [None]:
samtools view -H id.fixmate.srt.md.bam



## Bonus exercise 3 - clean you bam files using the FLAGS column

The second column in the SAM format is the very important FLAG. This will tell tell you about the state of the paired end mapping, QC duplicates etc.


  
Using the samtools -F/-f you can discard/include flags that fulfill certain patterns. See http://broadinstitute.github.io/picard/explain-flags.html .

  1. How many reads have we marked as duplicate in the final file.
  2. How many properly mapped read pairs do we have? (Where both reads map to the same chr etc).
  3. How many mapped reads do we have ?
  4. How many unmapped reads do we have ?
  5. Find the distribution of the RNAMES of the unmapped reads!?

 Run the following command one at a time by uncommenting them

In [None]:



#samtools view -f 1024 id.fixmate.srt.md.bam|wc -l
#samtools view -f 2 id.fixmate.srt.md.bam|wc -l
#samtools view -F 4 id.fixmate.srt.md.bam|wc -l
#samtools view -f 4 id.fixmate.srt.md.bam|wc -l
# samtools view -f 4 id.fixmate.srt.md.bam|cut -f3|sort -n |uniq -c




Compare with "samtools flagstat" command 


In [None]:
samtools flagstat id.fixmate.srt.md.bam


Make a new bamfile, where you only the reads where both ends maps, and filter out those with a mapping quality below 10, and removing duplicates


In [None]:
samtools view -f 2 -F 1024 id.fixmate.srt.md.bam -q 10 >new.bam