# Module 3: NGS and NGS formats
# PART 1: INTRODUCTION TO NEXT GENERATION SEQUENCING (NGS)

This module aims to provide you with an introduction to concepts related to next-generation technologies, sequencing platforms, and a brief introduction to genomics.

Please use this as a start base for your module: https://www.youtube.com/playlist?list=PLfovZnX0TvKtHq6Q4L5KdW332NCD4GbtU

___

## 01. Introduction to Genomics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete DNA set, including all its genes and hierarchical, three-dimensional structural configuration.

A significant part of genomics is determining the sequence of molecules that make up an organism's genomic deoxyribonucleic acid (DNA) content.

It can be applied to any organism to study different aspects like functional characteristics, Evolution, Epidemiology, Behaviour, epigenetics, etc.

Genetic information from viruses, bacteria, and other infectious organisms has long played a crucial role in these efforts. Advances in molecular technologies and bioinformatics have made it possible to examine pathogen genomes in much greater detail. Now, falling costs and turnaround times are bringing high-throughput genetic sequencing within reach for use by clinical and public health investigators.

### What is DNA?

Deoxyribonucleic acid (DNA) is the chemical compound that contains the instructions needed to develop and direct the activities of nearly all living organisms. DNA molecules are made of two twistings, paired strands often referred to as a double helix

Each DNA strand comprises four chemical units, called nucleotide bases, which comprise the genetic "alphabet." The bases are adenine (A), thymine (T), guanine (G), and cytosine (C). Bases on opposite strands pair specifically: an A always pairs with a T; a C always pairs with a G. The order of the As, Ts, Cs, and Gs determines the meaning of the information encoded in that part of the DNA molecule just as the order of letters determines the meaning of a word.

### What is Genome?

An organism's complete set of DNA is called its genome. Virtually every single cell in the body contains a complete copy of the approximately 3 billion DNA base pairs, or letters, that make up the human genome.

With its four-letter language, DNA contains the information needed to build the entire human body. A gene traditionally refers to the unit of DNA that carries the instructions for making a specific protein or set of proteins. Each of the estimated 20,000 to 25,000 genes in the human genome codes for an average of three proteins.

Genomics is the study of the molecular organization of genomes, their information content, and the gene products they encode. It is a broad discipline, which may be divided into at least three general areas. 

1. Structural genomics is the study of the physical nature of genomes. Its primary goal is to determine and analyze the DNA sequence of the genome.

2. Functional genomics is concerned with the way in which the genome functions. It examines the transcripts produced by the genome and the array of proteins they encode. 

3. Comparative genomics, in which genomes from different organisms are compared to look for significant differences and similarities.

### What is a Gene?

A gene is a basic unit of heredity and a sequence of nucleotides in DNA that encodes the synthesis of a gene product, either RNA or protein. Proteins makeup body structures like organs and tissue, as well as control chemical reactions and carry signals between cells. If a cell's DNA is mutated, an abnormal protein may be produced, which can disrupt the body's usual processes and lead to a disease such as cancer.

### Microbial Genomics

Microbes are among our planet's most ubiquitous organisms. They are present in every biosphere, including some of the most extreme locations on Earth. Microbes, in general, possess genomes much smaller in size compared to plants and animals, which makes them ideal for genetic and physiological studies.

Microbial genomics is largely the identification and characterization of their genetic compositions. The ability to process and analyze the genomic data collected from microbial organisms is a cornerstone of modern bioinformatics. Its broad applications cover every sector of our lives, such as ensuring the safety of the food supply, maintaining human health and wellness, countering the spread of disease, and protecting the environment.

With bioinformatics tools developed and in place, Noblis can analyze all aspects of microbial genomics. We can identify organisms, assess microbial populations in environmental niches, catalog evolutionary pathways, and define genetic relatedness between microbial strains, Furthermore, research is underway to explore the potential of using genomic traits to ascertain antibiotic resistance and virulence.

### Population Structure, Evolution, And Molecular Epidemiology

Differences in the sequence and structure of genomes from members of a microbial population reflect the composite effects of mutation, recombination, and selection. With the increasing availability of genome sequences, these effects have become better characterized and more effectively exploited so as to understand the history and evolution of microbes and viruses and their sometimes intimate relationships with humans. The resulting insights have practical importance for epidemiologic investigations, forensics, diagnostics, and vaccine development.

The power of full-genome sequencing to discriminate between closely related strains and track the real-time evolution of disease-associated clonal isolates offers the possibility of tracing person-to-person transmission and identifying point sources of outbreaks. 

Genomic approaches have introduced a new era in the discovery and detection of microbial pathogens. The robustness, reliability, and portability of molecular sequence-based data for phylogenetic assessments and for characterization of previously unrecognized pathogens, coupled with technological developments, recommend genomic approaches for both research and routine clinical application.

## 02. Next-Generation Sequencing

Sequencing means determining the exact order of the nucleotides in a given DNA or RNA sequence. Because bases exist as pairs, and the identity of one of the bases in the pair determines the other member of the pair, researchers do not have to report both bases of the pair.

In the most common type of sequencing used today, called sequencing by synthesis, DNA polymerase (the enzyme in cells that synthesizes DNA) is used to generate a new strand of DNA from a strand of interest. In the sequencing reaction, the enzyme incorporates into the new DNA strand individual nucleotides that have been chemically tagged with a fluorescent label. As this happens, the nucleotide is excited by a light source, and a fluorescent signal is emitted and detected. The signal is different depending on which of the four nucleotides was incorporated. This method can generate 'reads' of 125 nucleotides in a row and billions of reads at a time. Researchers can use DNA sequencing to search for genetic variations and/or mutations that may play a role in the development or progression of a disease. The disease-causing change may be as small as the substitution, deletion, or addition of a single base pair or as large as the deletion of thousands of bases.

![NGS](images/ngs_1.png)

*Taken from: https://www.nature.com/articles/nbt1486*

Next-generation sequencing (NGS) is a technology for determining the sequence of DNA or RNA to study genetic variation associated with diseases or other biological phenomena.  Introduced for commercial use in 2005, this method was initially called “massively-parallel sequencing”, because it enabled the sequencing of many DNA strands at the same time, instead of one at a time as with traditional Sanger sequencing by capillary electrophoresis

Next-generation sequencing, also called deep sequencing, is a technology that enables parallel multiplexed analysis of DNA sequences on a massive scale—millions to billions of sequences from individual single strands of DNA analyzed separately, yet simultaneously.

### NGS technologies


- Illumina (Solexa) sequencing: Illumina sequencing is based on a technique known as “bridge amplification” wherein DNA molecules (about 500 bpp) with appropriate adapters ligated on each end are used as substrates for repeated amplification synthesis reactions on solid support (glass slide) that contains oligonucleotide sequences complementary to a ligated adapter. The oligonucleotides on the slide are spaced such that the DNA, which is then subjected to repeated rounds of amplification, creates clonal “clusters” consisting of about 1000 copies of each oligonucleotide fragment. Each glass slide can support millions of parallel cluster reactions. During the synthesis reactions, proprietary modified nucleotides, corresponding to each of the four bases, each with a different fluorescent label, are incorporated and then detected. The nucleotides also act as terminators of synthesis for each reaction, which are unblocked after detection for the next round of synthesis. The reactions are repeated for 300 or more rounds. The use of fluorescent detection increases the speed of detection due to direct imaging, in contrast to camera-based imaging.
- Roche 454 sequencing: This method is based on pyrosequencing, a technique that detects pyrophosphate release, using a light signal (bioluminescence), after nucleotides are incorporated by the polymerase into a new strand of DNA. Roche 454 sequencing platform has been discontinued since 2016.
- Ion Torrent: Proton / PGM sequencing: Ion Torrent sequencing measures the direct release of H+ (protons) from the incorporation of individual bases by DNA polymerase and therefore differs from the previous two methods as it does not measure light.
- Pacbio: PacBio sequencing, also referred to as SMRT (Singe Molecule Real Time) sequencing, enables very long fragments to be sequenced, up to 30–50 kb, or longer. The SMRT method involves binding an engineered DNA polymerase, with bound DNA to be sequenced, to the bottom of a well (zero-mode waveguide (ZMW) in a SMRT flow cell.) .
- Nanopore: Nanopore-based DNA sequencing was first proposed in the late 1990s and commercialization has recently been achieved by Oxford Nanopore Technologies (ONT) with a portable MinION. These sequencers use protein nanopores in an electrically resistant polymer membrane through which characteristic current changes occur as each nucleotide passes thru the detector.


### NGS Workflow
![NGS](images/ngs_2.png)

*Taken from: https://www.biorender.com/template/next-generation-sequencing-workflow*

### Advantages of NGS

NGS can be used to analyze DNA and RNA samples and is a popular tool in functional genomics. In contrast to microarray methods, NGS-based approaches have several advantages including:

- A priori knowledge of the genome or genomic features is not required
- It offers single-nucleotide resolution, making it possible to detect related genes (or features), alternatively spliced transcripts, allelic gene variants, and single nucleotide polymorphisms
- The higher dynamic range of signal
-mRequires less DNA/RNA as input (nanograms of materials are sufficient)
- Higher reproducibility


### Challenges of NGS

There are several limitations to using next-generation sequencing. Next-generation sequencing provides information on a number of molecular aberrations. For many of the identified abnormalities, the clinical significance is currently unknown. Next-generation sequencing also requires sophisticated bioinformatics systems, fast data processing, and large data storage capabilities, which can be costly. Although many institutions may have the ability to purchase next-generation sequencing equipment, many lack the computational resources and staffing to analyze and clinically interpret the data.

### Application of NGS

NGS technologies are currently used for

- Whole genome sequencing
- Metagenomics
- Investigation of genome diversity
- Epigenetics
- Discovery of non-coding RNAs
- Protein-binding sites
- Gene-expression profiling by RNA sequencing

It should be emphasized that whole-genome sequence information provides an entirely new starting point for biological research. In the future, microbiologists will not have to spend as much time cloning genes because they will be able to generate new questions and hypotheses from computer analyses of genome data. Then they can test their hypotheses in the laboratory.

## 03. Introduction to sequencing platforms

It has been over 30 years since [the first generation of DNA sequencing](https://www.walshmedicalmedia.com/open-access/generations-of-sequencing-technologies-from-first-to-next-generation-0974-8369-1000395.pdf). technology was developed in 1977. Since then, sequencing platforms have made considerable progress and every transformation has led to a huge shift towards furthering genome research, clinical disease research and drug development.

NGS became available at the beginning of the 21st century. Perhaps the biggest advance that NGS offered was the ability to produce a huge amount of data, alongside its ability to provide a highly efficient, rapid, low-cost and accurate approach to DNA sequencing, beyond the reach of traditional Sanger methods Links to an external site..

### An Overview of Sequencing

The founding methods in DNA sequencing were the Sanger dideoxy synthesis (Sanger & Coulson, 1975
; Sanger, Nicklen, & Coulson, 1977 ) and Maxam-Gilbert chemical cleavage (Maxam & Gilbert, 1980)

Links to an external site.) methods. The Maxam-Gilbert method is based on chemical modification of DNA and subsequent cleavage of the DNA backbone at sites adjacent to the modified nucleotides. Sanger sequencing uses specific chain-terminating nucleotides (dideoxy nucleotides) that lack a 3′-OH group. Thus no phosphodiester bond can be formed by DNA polymerase, resulting in termination of the growing DNA chain at that position. The ddNTPs are radioactively or fluorescently labeled for detection in “sequencing” gels or automated sequencing machines, respectively. Although the chemistry of the original Maxam-Gilbert method has been modified to help eliminate toxic reagents, the Sanger sequencing by synthesis (SBS) dideoxy method has become the sequencing standard.

The Sanger sequencing method was developed in 1977. Although relatively slow by current NGS standards, improvements in the Sanger chain termination methodology, automation, and commercialization have enabled it to remain the most appropriate sequencing method for many current applications.

![NGS](images/sanger.png)

*Taken from: https://microbeonline.com/dna-sequencing-sanger-sequencing-method/*

### Second Generation Sequencing Methods

Second-generation NGS technologies, of the kind developed by Illumina and others, can be grouped into two major categories – sequencing by hybridization
Links to an external site. or sequencing by synthesis Links to an external site.. Sequencing by hybridization is an approach whereby a collection of overlapping oligonucleotide sequences is assembled together to determine the DNA sequence. Sequencing by synthesis technology uses a polymerase or ligase enzyme to incorporate nucleotides with a fluorescent tag, which are then identified to determine the DNA sequence.

![NGS](images/synthesis.png)

*Taken from: https://www.mdpi.com/2075-4418/13/3/373*


All second-generation NGS technologies are dependent on amplification before sequence analysis. This amplification step is needed to generate a large enough number of copies of each DNA template Links to an external site. so that there is sufficient signal strength for each base addition.

**Advantages of second-generation NGS**

- High sequence accuracy
- Relatively cheap
- Able to sequence fragmented DNA

**Disadvantages of second-generation NGS**

- Only capable of producing short sequencing reads (reads are between 200-300 bases long)
- Not able to resolve structural variants or distinguish highly homologous genomic regions
- Not suitable for analysis of sequences that contain large numbers of repetitive sequence elements, transcript isoforms or methylation signature

### Third Generation Sequencing Methods

Third-generation NGS is a class of DNA sequencing methods that were first described around 2009 and are still under active development. These technologies are capable of producing substantially longer reads than second-generation sequencing, with wide implications for genome research. Particularly useful applications of third-generation NGS include the study of epigenetic markers, transcriptomics and metagenomics

These machines sequence single DNA molecules and do not amplify templates before sequencing. Instead, methodologies have been developed to directly increase DNA enough to obtain sufficient signal strength without amplification.

**Advantages of third-generation NGS**

- Possible to start with considerably longer DNA fragments
- Lack of amplification leads to easier library preparation and portable technologies
- Epigenetic markers are stable and so methylation signatures and histone modifications are preserved
- Generates very long sequence reads

**Disadvantages of third-generation NGS**

- Signals obtained from individual fragments can be weak
- Overall lower accuracy

**There are two principal companies that develop third-generation NGS technologies – Pacific Biosciences and Oxford Nanopore Technologies. Each takes a fundamentally different approach to sequencing.**

#### Pacific Biosciences sequencing chemistry

SMRT sequencing is the core technology that powers Pacific Biosciences platforms. The SMRT Cell contains millions of tiny wells called zero-mode waveguides. Single DNA molecules are immobilised at the bottom of these wells whilst DNA polymerase incorporates fluorescently labelled nucleotides. To detect the addition of each base, the light emitted at the top of the zero-mode waveguide is recorded and analysed. This methodology allows DNA fragments to be read multiple times by synthesizing oligonucleotides that are attached to the ends of DNA fragments and shaping them into ‘smart-bells’. These individual circular molecules enable the polymerase to go around the DNA multiple times, resulting in much higher sequencing accuracy. This technology can generate very long sequence reads and much longer DNA fragments can be used.

![pacbio](images/pacbio.png)

*Taken from: https://www.pacb.com/engage/attachment/how-to-get-hifi-reads_v2/*

#### Oxford Nanopore sequencing chemistry

Oxford Nanopore Technologies developed a sequencing technology that determines the sequence of DNA molecules as they are threaded through a small nanopore. The platforms work by passing an ionic current through nanopores and measuring the changes in electrical charge as nucleotides pass through the small pore. The nanopores can be created by proteins that puncture membranes or solid material. An adapted phi29 motor protein. is used to thread the DNA into the nanopore. As the electrical current changes across the nanopore, it is possible to determine the sequence of nucleotides being passed through it.

![nanopore](images/nanopore.jpg)

*Taken from: https://www.genome.gov/genetics-glossary/Nanopore-DNA-Sequencing*

___

# PART 2: NGS FORMATS

>This is a general module to help you get familiarised with data formats in a practical way. Data QC and making a consensus sequence will be explored in more detail in later modules.
 

# Commonly used file formats for next-generation sequencing (NGS) data

In this session, we are going to get familiar with several common file formats used for sequence data. Then we are going to perform some quality control (QC) on some FASTQ-formatted sequence data.



### FASTA

Among the most common and simplest file formats for representing nucleotide sequences is FASTA.  Essentially, each sequence is represented by a 'header' line that begins with a '>', followed by lines containing the actual nucleotide sequence. By convention, the first 'word' in the header line is a unique identifier, which is usually as accession number. Consider this example of a FASTA-formatted nucleotide sequence:

    >LC719646.1 Influenza A virus (A/swine/Tottori/B34/2020(H1N1)) segment 8 NS1, NEP genes for nonstructural protein 1, nuclear export protein, complete cds
    ATGGAATCCAACACCATGTCAAGCTTTCAGGTAGACTGTTTTCTTTGGCATATTCGCAAGCGATTTGCAG
    ACAATGGATTGGGTGATGCCCCATTCCTTGATCGGCTACGCCGAGATCAAAAGTCCTTAAAAGGAAGAGG
    CAACACCCTTGGCCTCGACATCAAAACAGCCACTCTTGTTGGGAAACAAATTGTGGAATGGATTTTGAAA
    GAGGAATCCAGCGAGACACTTAGAATGGCAATTGCATCTGTACCTACTTCGCGTTACATTTCTGACATGG
    CCCTCGAGGAAATGTCACGAGACTGGTTCATGCTTATGCCTAGGCAAAAGATAATAGGCCCTCTTTGCGT
    GCGATTGGACCAGGCGGTCATGGATAAGAACGTAGTACTGGAAGCAAACTTCAGTGTAATCTTCAACCGA
    TTAGAGACCTTGATACTACTAAGGGCTTTCACTGAGGAGGGAACAATAGTTGGAGAAATTTCACCATTAC
    CTTCTCTTCCAGGACATACTTATGAGGATGTCAAAAATGCAGTTGGGGTYCTCATCGGAGGACTTGAGTG
    GAATGGTAACACGGTTCGAGTCTCTGAAAATATACAGAGATTCGCTTGGAGAAGCTGTGATGAGAATGGG
    AGACCTTCACTACCTCCAGAGCAGAAATGAGAAGTGGCGGGAACAATTGGGACAGAAATTTGAGGAAATA
    AGGTGGTTAATTGAAGAAATACGACACAGATTGAAAGCGACAGAGAATAGTTTCGAACAAATAACATTTA
    TGCAAGCCTTACAACTACTGCTTGAAGTAGAGCAAGAGATAAGAGCTTTCTCGTTTCAGCTTATTTAA

- The first line begins with '>' indicating that it is the header line.
- This is immediately followed by 'LC719646.1', which is an accession number for [this sequence in the GenBank database](https://www.ncbi.nlm.nih.gov/nuccore/LC719646.1).
- Then follows the actual nucleotide sequence, split over several lines, beginning with 'ATGGAATCCAACA...' and ending with '...TTATTTAA'.

It is very common to combine multiple sequences into a single multi-FASTA file like this:

    >ON084923.1 Influenza A virus (A/ostriches (Struthio camelus)/Egypt/Mansoura1/2022(H5N8)) segment 4 hemagglutinin, HA2 region, (HA) gene, partial cds
    GTACCACCATAGCAATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGAT
    GGAGTTACCAATAAGGTCAACTCAATCATTGACAAAATGAACACTCAATTTGAGGCAGTTGGAAGGGAGT
    TTAATAACTTAGAAAGGAGGATAGAGAATTTGA
    
    >MW170960.1 Influenza A virus (A/swine/Italy/410927/2018(H1N2)) segment 6 neuraminidase (NA) gene, partial cds
    CCTTATGCAGATTGCTATCCTGGTAACTACTGTTACATTTCACTTCAAGCAATATGAATACAATTTCTAC
    CCAAACAACCAAGTAATGCCATGTGAACCAACGATAATTGAAAGAAACATAACAGAAATAGTGTACCTGG
    CCAACACCAC
    
    >MW170083.1 Influenza A virus (A/swine/Italy/134212/2019(H1N2)) segment 6 neuraminidase (NA) gene, partial cds
    GTAGTAACTGCCTGAGTCCTAATAATGAAGAAGGGGGTCATGGGGTAAAAGGCTGGGCCTTTGATGATGG
    AAATGATGTTTGGATGGGAAGAACGATCAGCGAAAAGTTACGATTAGGTTATGAAACCTTCAAGGTCATC
    GACGGTTGGTCCAAGCC
    
    >MW169741.1 Influenza A virus (A/swine/Italy/8745/2019(H3N2)) segment 2 polymerase PB1 (PB1) gene, partial cds
    TCGTTCCATCCTCAATACTAGCCAAAGGGGAATTCTTGAGGATGAGCAAATGTATCAGAAGTGCTGCAAT
    TTATTTGAGAAATTCTTCCCTAGCAGTTCATACAGGAGGCCAGTGGGAATTTCAAGCATGGTGGAGGCCA
    TGGTATCTAGGGCCAGAATTGATGCACGGATTGATTTCGAGTCTGGAAGGATTAATAAAGAAGAATTTGC
    TGAGATCATGAAGATCTGTTCCACCATAGAAGAGTTCAGACGGCAAAAGTAG
    
    >OM149369.1 Influenza A virus (A/Hilly chicken/Bangladesh/Avian Influenza Virus/2019(H9)) segment 4 hemagglutinin (HA) gene, partial cds
    AATTTCTTAGCTAGCAAAATGGAAACAATAACACTGATGACTACACTACTATTAACAACAACGAGCCTTG
    CAGACAAAATCTGTATCGGCCACCAATCGACAAATTCTACAGAAACTGTAGACACACTAACAGAAACTAA
    CGTTCCTGTGACACATGCCAAAGAGTTGCTCCATACGGATCACAATGGAATGCTGTGTGCAACAAATCTA
    GGACATCCCCTCATCCTAGATAAATGTAACGTAGAAGGACTGATCTACGGCAACCCTTCTTGTGATCT


If you want a more detailed history of the FASTA file format, then you could take a look at the Wikipedia page here: https://en.wikipedia.org/wiki/FASTA_format.



### FASTQ
The widely used FASTA file format has the great advantage of simplicity. However, this simplicity can be restrictive if we want to include additional data/metadata in addition to the sequence.
Given the non-negligible error rates of NGS technologies, often we need to accompany our sequence data with quality scores that estimate our confidence in the accuracy of the sequence data. As we will see later, this allows us to perform quality control checks and filter-out poor-quality data before performing analyses.
FASTQ is a simple text-based format that allows us to include quality scores. A single sequence is represented by four lines of text:

    @ERR8261968.1 1 length=97
    ACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTA
    +ERR8261968.1 1 length=97
    CCCCCFDDFFFFGGGGGGGGGGHHHHHHHHHHHGGGGHHHHHHHHHHHHHHHGHHGHHIIHHGGGGGGHHHHHHHHHHHHHHHHHHHGGGHHHHHHH

- The first line is a 'header' containing a unique identifier for the sequence and, optionally, further description.
- The second line contains the actual nucleotide sequence.
- The third line is redundant  and can be safely ignored. Sometimes it simply repeats the first line. Sometimes it is blank or just contains a '+' character.
- The fourth line contains a string of characters that encode quality scores for each nucleotide in the sequence. Each single character encodes a score, typically   a number between 0 and 40; this score is encoded by a single character, as we saw during the introductory lecture.

| Character | ASCII | FASTQ quality score (ASCII – 33) 
| --|--|--
| ! | 33 | 0
| “ | 34 | 1
| # | 35 | 2
| $ | 36 | 3
| % | 37 | 4
| ... | ... | ...
| C | 67 | 34
| D | 68 | 35
| E | 69 | 36
| F | 70 | 37
| G | 71 | 38
| H | 72 | 39
|40 | 73 | 40

So, in the example above, we can see that most of the positions within the 97-nucleotide sequence have scores in the high 30s, which indicates a high degree of confidence in their accuracy.
- A score of 30 denotes a 1 in 1000 chance of an error, i.e. 99.9 %accuracy.
- A score of 40 denotes a 1 in 10,000 chance of an error, i.e. 99.99 %accuracy.

You can read more about the FastQ file format and quality scores here:
Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., & Rice, P. M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. *Nucleic Acids Research*, **38**, 1767–1771. https://doi.org/10.1093/nar/gkp1137.



### SAM and BAM

A SAM file (usually named *.sam) is used to represent aligned sequences. It is particularly useful for storing the results of aligning genomic or transcriptomic sequence reads aligned against a reference genome sequence. The BAM file format is a compressed form of SAM. This has the disadvantage that it is not readable by a human but has the advantage of being smaller than the corresponding SAM file and thus easier to share and copy between locations.

Entries in the header section always start with “@” and come before the alignment section. Each line in the header is tab-delimited and has a two letter header code called TAG. They follow the “TAG:VALUE” format. These TAG are:

    HD - The Header Line - 1st line
    SO - Sorting order of alignments (unknown(default), unsorted, query-name and coordinate)
    SQ - Reference sequence dictionary
    SN - Reference sequence name
    LN - Reference sequence length
    PG - Program
    ID - The program ID
    PN - The program name
    VN - Program version number
    CL - The Command actually used to create the SAM file
    RG - Read Group - “a set of reads that were all the product of a single sequencing run on one lane”

In the alignment section, there are 11 mandatory fields. These are:

    QNAME: Read Name
    FLAG: Info on if the read is mapped, part of a pair, strand etc
    RNAME: Reference Sequence Name that the read aligns to
    POS: Leftmost position of where this alignment maps to the reference
    MAPQ:Mapping quality of read to reference (phred scale P that mapping is wrong)
    CIGAR: Compact Idiosyncratic Gapped Alignment Report
    RNEXT:Paired Mate Read Name
    PNEXT:Paired Mate Position
    TLEN:Template length/Insert Size (difference in outer coordinates of paired reads)
    SEQ:The actual read DNA sequence
    QUAL:ASCII Phred quality scores (+33)
    TAGS:Optional data - Lots of options e.g MD=String for mismatches


You can read more about SAM and BAM formats here:
 - Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. *Bioinformatics*, **25**, 2078–2079. https://doi.org/10.1093/bioinformatics/btp352 and
-  [https://samtools.github.io/hts-specs/SAMv1.pdf](https://samtools.github.io/hts-specs/SAMv1.pdf).

We can view BAM files graphically using a specialised genome browser software such as:
- [IGV](https://igv.org/)
- [Tablet](https://ics.hutton.ac.uk/tablet/)
- [Artemis / BAMview](http://sanger-pathogens.github.io/Artemis/BamView/) 



### Binary Alignment Map (BAM) format

Binary Alignment Map (BAM) is a compressed SAM file. It is compressed using the BGZF compression method. 

### CRAM files 

CRAM files are also compressed SAM files, designed by the EBI to reduce the storage space. CRAM files compression is based on the reference the data is aligned to.

Data is compressed using one of the general purpose compressors (gzip, bzip2). CRAM records are compressed using several different encoding strategies. For example, bases are reference compressed by encoding base differences rather than storing the abses themselves. External reference sequences introduce the only external dependency into the CRAM format. When external reference sequences cannot be conveniently used the reference sequences also can be embedded within the CRAM files. However, when embedded reference sequences are used then only those reference sequence regions are preserved in CRAM that has reads aligned against them.

### GFF3 format

GFF3 stands for gene feature file version 3. This is a tab-delimited file containing all the information that can be associated with a DNA or protein sequence. An example can be seen in the below figure.

The file contains 9 fields:

    Sequence ID
    Source : algorithm used to derive the feature such as prodigal, prokka, Genescan etc.
    Feature type : deails what the feature is (cds, mRNA)
    Feature start
    Feature stop
    Score : these are e-values from the algorithm used
    Strand
    Phase : describes the reading frame relative to the reference where the featire begins. It has values 0, 1 and 2 for indicating number of based from the beginning where the first codon on the feature begins
    Attributes : provides additional information about each feature


## SeqKit

[SeqKit](https://github.com/shenwei356/seqkit?tab=readme-ov-file) is a tool that has gained popularity in recent years and allows manipulation of FASTA/FASTQ and BAM files. In addition to this, we can view general statistics, make format changes, and edit files.

Here you can find the uses and examples of the tool: https://bioinf.shenwei.me/seqkit/usage/

In [None]:
#Install SeqKit
!conda install -c bioconda seqkit

In [None]:
#To view the functionalities of SeqKit, execute the following command
!seqkit -h

*Later on, we will review some examples*

## Public repositories of NGS data 
The Sequence Read Archive (SRA) contains a huge number of sequence reads generated by various NGS methods. We can browse this data on the web via the NCBI's web portal. We can also download NGS datasets in FastQ format and analyse them locally, for example in our virtual machine. Let's a take a look at an example dataset: [SRR19504912](https://www.ncbi.nlm.nih.gov/sra/?term=SRR19504912)

. Which virus does this sequencing dataset come from?

Let's use the web interface to take a look at a few of the sequence reads in this dataset. Click on where it says [SRR19504912](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR19504912) under 'Run'. Then click on the 'Reads' tab. This will take you to [this page](https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&page_size=10&acc=SRR19504912&display=reads), which looks like this:

![enter image description here](https://github.com/WCSCourses/ViralBioinfAsia2022/raw/main/course_data/NGS_file_formats_and_data_QC/images/Screenshot%202022-07-31%20at%2016.05.10.png)

In the figure above, we can see a single sequence read along with the quality scores for each nucleotide position in its sequence. Notice that the scores are high (well above 30) for most of this sequence read.

Now let's download the sequence data (i.e. the whole set of reads) from this sequencing run from the SRA. Unfortunately it is not easy to download the data directly from the NCBI website; instead we have to use the *fasterq-dump* tool from the [NCBI's SRA Toolkit](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). So, first execute this command in the Terminal:

In [None]:
#download the SRA tools using bioconda
!conda install -c bioconda sra-tools

In [None]:
#Download the file SRR19504912 from SRA with fastq-dump
!fastq-dump --split-files SRR19504912

You should then see some output something like this:

    spots read      : 306,691
    reads read      : 613,382
    reads written   : 613,382

you will notice that new files have been created called *SRR19504912_1.fastq*  and *SRR19504912_2.fastq*. There are two files because this dataset consists of paired sequence reads. 

In [None]:
#Review the statistics of the downloaded files using SeqKit
!seqkit stats *.fastq -T

In [None]:
#You can use the following command to view the contents of FASTQ files
!seqkit head -n 5 SRR19504912_1.fastq

In [None]:
#You can use the following command to convert FASTQ files to FASTA
!seqkit fq2fa SRR19504912_1.fastq -o SRR19504912_1.fasta

In [None]:
#Print ID and sequence length of the FASTA file
!seqkit fx2tab SRR19504912_1.fasta -w 0

In [None]:
#Split the FASTA file into multiple files with 500 sequences each
!seqkit split -s 500 SRR19504912_1.fasta

## References

https://www.technologynetworks.com/genomics/articles/an-overview-of-next-generation-sequencing-346532

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6020069/

https://www.nature.com/articles/nbt1486

https://omicstutorials.com/next-generation-sequencing-ngs-introduction/

https://www.healio.com/hematology-oncology/learn-genomics/whole-genome-sequencing/strengths-and-limitations-of-next-generation-sequencing

https://www.mlsu.ac.in/econtents/1111_Microbial%20genomes.pdf

http://www.cs.cornell.edu/projects/btr/bioinformaticsschool/slides/stanhope.pdf

https://www.genome.gov/about-genomics/fact-sheets/A-Brief-Guide-to-Genomics

https://www.britannica.com/science/genomics

https://www.biorender.com/template/next-generation-sequencing-workflow

https://microbeonline.com/dna-sequencing-sanger-sequencing-method/

https://www.mdpi.com/2075-4418/13/3/373

https://www.genome.gov/genetics-glossary/Nanopore-DNA-Sequencing

https://frontlinegenomics.com/dna-sequencing-how-to-choose-the-right-technology/


*Adapted from:*
 
- Advanced Bioinformatics Course developed for the GPS and JUNO projects - Wellcome Sanger Insitute
- SARS-CoV-2 Bioinformatics for Beginners Course - Wellcome Connecting Science
- Viral Genomics and Bioinformatics Asia 2022 - Wellcome Connecting Science


*Modified by Luisa Sacristán (Universidad de los Andes-CABANA)*
