# Data formats for NGS

Here we will take a closer look at some of the most common data formats used in NGS analysis.

First check you are in the right directory

In [None]:
pwd

It should display something like

`/home/manager/course_data/data_formats/data`

## FASTA
The FASTA format is one of the most common and simplest file formats for representing nucleotide sequence data. Each sequence in a FASTA file is composed of two parts, a header line and the actual sequence. The header always starts with the symbol ">" and is followed by information about the sequence, such as a sequence name/unique identifier. 

Let's look at an example:

```
>Sequence_1
CTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATG
AAATCATGACGACTTGAAGTGAAAAAGTGAAAAATGAGAAATGAACGTGACGAC
AAAATGACGAAATCACTAAAAAACGTGACGACTTGAAAAATGACCAC
```
We can see that for each sequence we get two lines of text:

* The first line begins with '>' indicating that it is the 'header' line. This is immediately followed by 'Sequence_1', which is the unique identifier for this sequence.
* The second line is the actual nucleotide sequence, split over several lines, beginning with 'CTTGACGACTTGAA...' and ending with '...TGACCAC'.

It is also possible to have multiple sequences in one multi-FASTA file like this:

```
>Sequence_1
AAATCATGACGACTTGAAGTGAAAAAGTGAAAAATGAGAAATGAACGTGACGAC
CGAATGACGAAATCACTAAAAAACGTGACGACTTGAAAAATGACCAC
>Sequence_2
CTTGAGACGAAATCACTAAAAAACGTGACGACTTGAAGTGAAAAATGAGAAATG
AAAATGACGAAATCATGACGACTTGAAGTGAAAAATAAATGACC
```

### Exercises
__Q1: How many sequences are there in the fasta file example.fasta? (hint: is there a grep command you can use?)__

If you get stuck here, do not spend too much time trying to solve this and move on. A solution will be provided during the session.

## FASTQ
Often we need to accompany our sequence data with quality scores that estimate our confidence in the accuracy of the sequence data. The FASTQ format is an extension of the FASTA file format, and includes a quality score for each nucleotide in the sequence.  

Let's look at an example:
  
```
@ERR007731.739 IL16_2979:6:1:9:1684/1
CTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATG
+
BBCBCBBBBBBBABBABBBBBBBABBBBBBBBBBBBBBABAAAABBBBB=@>B
```

We can see that for each sequence we get four lines of text:

   * The first line is a 'header' containing a unique identifier for the sequence and, optionally, further information. 
   * The second line contains the nucleotide sequence
   * The third line starts with `+` and optionally contains the ID again. This line is redundant and can be safely ignored.
   *  The fourth line contains a string of characters that encode quality scores for each nucleotide in the sequence.

### FASTQ and illumina data
Paired-end illumina sequencing involves reading from both ends of a sequence fragment as illustrated below.

![Paired-end Reads](img/paired_end_reads.png)

The above image has been taken from
[https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/read-length.html#](https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/read-length.html#)

During this process two FASTQ files are produced. One of the FASTQ files is usually named with `_1.fastq` at the end and contains the sequencing data from 'reading' one end of the fragment (reads). The other FASTQ file is usually named with `_2.fastq` at the end and contains the sequencing data from 'reading' the other end of the fragment (reads). Each fragment has a unique name and the sequence read for one end of the fragment is labelled with the fragment name followed by `/1`. The corresponding sequence read for the other end of the fragment is labelled with the fragment name followed by `/2`. The order in which the reads appear in the FASTQ files is also important as many tools assume that the first read in the `_1.fastq` file and first read in the `_2.fastq` file are from the same fragment and so on. 

So FASTQ data for an illumina paired end sequencing run might look like:

In one fastq file (e.g. run_1.fastq):

```
@ERR007731.739 IL16_2979:6:1:9:1684/1
CTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATG
+
BBCBCBBBBBBBABBABBBBBBBABBBBBBBBBBBBBBABAAAABBBBB=@>B
@ERR007731.740 IL16_2979:6:1:9:1419/1
TAGGGGAAAGTTCTCATGAGTACATCCGAAAAGAGGGCAACCCCAAATCAAAAG
+
BBABBBABABAABABABBABBBAAA>@B@BBAA@4AAA>.>BAA@779:AAA@A
```

The other fastq file (e.g. run_2.fastq):

```
@ERR007731.739 IL16_2979:6:1:9:1684/2
GATGACGACCCGAAAAATTACGAAATCACTAGCCAACGTGAATTTTGAGAACTA
+
BBABCBCBBBBBABBABBBBBBBABBBBBBBBBBBBBBAAAAAABBBBBA@>B
@ERR007731.740 IL16_2979:6:1:9:1419/2
AAAAAAAAAGATGTCATCAGCACATCAGAAAAGAAGGCAACTTTAAAACTTTTC
+
BBCBBBABABBABABABBABBBAAA>@BBBBAA@4AAA>.>BAA@779:AA>@A
```

### Quality scores
In a FASTQ file, a single character encodes a quality score, typically a number between 0 and 40 (but in theory this can range between 0-93). Each character maps to an ASCII value which in turn can be converted to a quality score.

| Character | ASCII | FASTQ quality score (ASCII – 33) 
| --|--|--
| ! | 33 | 0
| “ | 34 | 1
| # | 35 | 2
| $ | 36 | 3
| % | 37 | 4
| ... | ... | ...
| C | 67 | 34
| D | 68 | 35
| E | 69 | 36
| F | 70 | 37
| G | 71 | 38
| H | 72 | 39
| I | 73 | 40

The first 32 ASCII codes are reserved for characters which are not printable (e.g. tab, return, space etc.). None of these can be used in the quality string, so we subtract 33 from the ASCII value of the character to determine the quality score. For example, the ASCII code for “C” is 67, so the corresponding quality is:
   
```   
Q = 67 - 33 = 34
```
   
So, in the FASTQ examples above, most of the base calls have scores in the 30s, which indicates a high degree of confidence in their accuracy. A score of 30 denotes a 1 in 1000 chance of an error, i.e. 99.9% accuracy.

|Quality Score|Probability of incorrect base call|Base call accuracy|
|---|---|---|
|10|1 in 10|90%|
|20|1 in 100|99%|
|30|1 in 1000|99.9%|
|40|1 in 10,000|99.99%|
|50|1 in 100,000|99.999%|
|60|1 in 1,000,000|99.9999%|


You don't need to worry about being able to convert the characters to a quality score as most of the software tools can interpret them automatically. But the following perl command will print the quality score for an ASCII character. Try changing the "A" to another character, for example one from the quality strings above (e.g. @, = or B).

In [None]:
perl -e 'printf "%d\n",ord("A")-33;'

### Exercises
__Q2: How many reads are there in the file example.fastq? (Hint: remember that `@` is a possible quality score. Is there something else in the header that is unique?)__

Again, don't worry if you cannot solve this, a solution will be provided during the practical session.

__Note__: The FASTQ format is a text based file, however it is possible (and good practice) to compress these files with `gzip`. A gzipped fastq file is usually suffixed with `.fastq.gz` or `.fq.gz`.

## SAM/BAM
A common task with sequence data is to match or align it to a reference genome. [SAM (Sequence Alignment/Map)](https://samtools.github.io/hts-specs/SAMv1.pdf) is a standard format for storing sequence read alignments to a reference genome. If no reference genome is available, the data can be stored unaligned. SAM is a text based file. BAM is the compressed binary version of SAM. Compressed binary files are not readable by a human but are smaller than the corresponding uncompressed file meaning they take up less disk space and make it easier and quicker to copy files between locations.

SAM/BAM files consist of a header section (optional) and an alignment section. The alignment section contains one record (a fragment alignment) per line describing the alignment between fragment and reference. Each record has 11 fixed columns and optional key:type:value tuples. Open the [SAM/BAM file specification document](https://samtools.github.io/hts-specs/SAMv1.pdf) as you may need to refer to it throughout this tutorial. 

Now let us have a closer look at the different parts of the SAM/BAM files. 

### Header Section
Each line or record in the SAM header starts with an `@`, followed by a two-letter code defining the record type (the different types are defined in the [SAM/BAM format specification document](https://samtools.github.io/hts-specs/SAMv1.pdf)). Each line or record contains meta-data for that specific record which is captured as a series of key-value pairs in the format of ‘TAG:VALUE’.

#### Read groups
One useful record type is RG which can be used to describe each unit of sequencing, e.g. a barcode or lane of sequencing data for Illumina. The RG code can be used to capture extra meta-data for the unit of sequencing. Some common RG TAGs are:

* ID: Read group identifier
* PL: Sequencing platform
* LB: Library name
* PI: Predicted insert/fragment size
* DS: Description
* SM: Sample identifier
* CN: Sequencing centre

### Exercises
Look at the following line from the header of the SAM/BAM file and answer the questions that follow: 

```
@RG ID:ERR003612 PL:ILLUMINA LB:g1k-sc-NA20538-TOS-1 PI:2000 DS:SRP000540 SM:NA20538 CN:SC
```

You may want to refer to section 1.3 of the SAM specification.
   
__Q3: What does RG stand for?__

__Q4: What platform was used to produce the data?__

__Q5: Where was the sequence data produced?__

__Q6: What is the expected insert/fragment size?__  

### Alignment Section
The alignment section of a SAM file contains one line per alignment. Each line consists of 12 fields/columns described below. The first 11 columns are mandatory.

1. QNAME: Query NAME of the read or the read pair
2. FLAG: Bitwise FLAG (pairing, strand, mate strand, etc.)
3. RNAME: Reference sequence NAME
4. POS: 1-Based leftmost POSition of clipped alignment
5. MAPQ: MAPping Quality (Phred-scaled)
6. CIGAR: Extended CIGAR string (operations: MIDNSHPX=)
7. MRNM: Mate Reference NaMe (’=’ if same as RNAME)
8. MPOS: 1-Based leftmost Mate POSition
9. ISIZE: Inferred Insert SIZE
10. SEQ: Query SEQuence on the same strand as the reference
11. QUAL: Query QUALity (ASCII-33=Phred base quality)
12. OTHER: Optional fields

The image below provides a visual guide to some of the fields/columns of the SAM format.

![SAM format](img/SAM_BAM.png)

In a SAM file, the alignment in this image would be represented in a SAM/BAM file as:

`fragment001	163	Chr1	19999970	23	40M5D30M2I28M	=	20000147	213	GGTGCGTGGAT...   
<=@A@??@=A...`   


### Exercises
Let's have a look at example.sam. Notice that we can use the standard Linux operations like __less__ on this file.

In [None]:
less -S example.sam

__Q7: What is the mapping quality of ERR003762.5016205? (Hint: can you use grep and awk to find this?)__

__Q8: What is the CIGAR string for ERR003814.6979522? (We will go through the meaning of CIGAR strings in the next section)__

__Q9: What is the inferred insert/fragment size for ERR003814.1408899?__

### CIGAR string
Column 6 of the alignment is the CIGAR string for that alignment. The CIGAR string provides a compact representation of sequence alignment. Have a look at the table below. It contains the meaning of all different symbols of a CIGAR string:

|Symbol    |Meaning                                             |
|---       |---                                                 |
|M         |alignment match or mismatch                         |
|=         |sequence match                                      |
|X         |sequence mismatch                                   |
|I         |insertion into the reference                        |
|D         |deletion from the reference                         |
|S         |soft clipping (clipped sequences present in SEQ)    |
|H         |hard clipping (clipped sequences NOT present in SEQ)|
|N         |skipped region from the reference                   |
|P         |padding (silent deletion from padded reference)     |

Below are two examples describing the CIGAR string in more detail.
  
__Example 1:__  
Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACGTACGTACGTACGT  
Read:&nbsp;&nbsp;ACGT-&nbsp;-&nbsp;-&nbsp;-&nbsp;ACGTACGA  
Cigar: 4M 4D 8M  

The first four bases in the read are the same as in the reference, so we can represent these as 4M in the CIGAR string. Next is a deletion of 4 bases, represented by 4D, followed by 7 alignment matches and one alignment mismatch, represented by 8M. Note that the mismatch at position 16 is included in 8M. This is because it still aligns to the reference.

__Example 2:__  
Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACTCAGTG-&nbsp;-&nbsp;GT  
Read:&nbsp;&nbsp;ACGCA-&nbsp;TGCAGTtagacgt  
Cigar: 5M 1D 2M 2I 2M 7S  

Here we start with 5 alignment matches and mismatches, followed by a deletion of one base. Then we have two more alignment matches, an insertion of 2 bases and two more matches. At the end, we have a soft clipping of 7 bases, 7S. These are clipped sequences that are present in the read but do not match the reference.

### Exercises
__Q10: What does the CIGAR from Q8 mean?__

__Q11: How would you represent the following alignment with a CIGAR string?__  
 
Ref:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ACGT-&nbsp;-&nbsp;-&nbsp;-&nbsp;ACGTACGT  
Read:&nbsp;&nbsp;ACGTACGTACGTACGT  

### Flags
Column 2 of the alignment contains a combination of bitwise FLAGs providing detailed information about the alignment. The following table details the meaning of each flag
   
|Hex  |Dec |Flag         |Description                                           |
|---  |--- |---          |---                                                   |
|0x1  |1   |PAIRED       |paired-end (or multiple-segment) sequencing technology|
|0x2  |2   |PROPER_PAIR  |each segment properly aligned according to the aligner|
|0x4  |4   |UNMAP        |segment unmapped                                      |
|0x8  |8   |MUNMAP       |next segment in the template unmapped                 |
|0x10 |16  |REVERSE      |SEQ is reverse complemented                           |
|0x20 |32  |MREVERSE     |SEQ of the next segment in the template is reversed   |
|0x40 |64  |READ1        |the first segment in the template                     |
|0x80 |128 |READ2        |the last segment in the template                      |
|0x100|256 |SECONDARY    |secondary alignment                                   |
|0x200|512 |QCFAIL       |not passing quality controls                          |
|0x400|1024|DUP          |PCR or optical duplicate                              |
|0x800|2048|SUPPLEMENTARY|supplementary alignment                               |

For example, if you have an alignment with FLAG set to 113, this can only be represented by decimal numbers `64 + 32 + 16 + 1`, so we know that these four flags apply to the alignment and the alignment is paired, reverse complemented, the sequence of the next template/read in the fragment is reversed and the read aligned is the first read in the template.

#### Primary, secondary and supplementary alignments
A read that aligns to a single position in a reference (including insertions, deletions, skips and clipping but not direction changes), is a __linear alignment__. If a read cannot be represented as a linear alignment, but instead is represented as a group of linear alignments without large overlaps, it is called a __chimeric alignment__. These can for instance be caused by structural variations. Usually, one of the linear alignments in a chimeric alignment is considered to be the __representative__ alignment, and the others are called __supplementary__.

Sometimes a read maps equally well to more than one location. In these cases, one of the possible alignments is marked as the __primary__ alignment and the rest are marked as __secondary__ alignments.

### BAM
BAM (Binary Alignment/Map) format, is a binary compressed version of SAM. This means that, while SAM is human readable, BAM is only readable for computers. BAM files can be viewed using samtools, and will then have the same format as a SAM file. The key features of BAM are:

* Stores alignments from most mapping tools
* Supports multiple sequencing technologies
* Supports indexing for quick retrieval/viewing of alignments
* Compact size (e.g. 112Gbp Illumina = 116GB disk space)
* Reads can be grouped into logical groups e.g. lanes, libraries, samples
* Widely supported by variant calling packages and genome viewers

### Exercises
Since BAM is a binary format, we can't use the standard Linux operations (cat, less, head, grep etc.) directly on this format. `Samtools` is a set of programs for interacting with SAM and BAM files. Using the __samtools view__ command, print the header of the BAM file:

In [None]:
samtools view -H NA20538.bam

__Q12: What version of the human assembly was used to perform the alignments? (Hint: Can you spot this somewhere in the @SQ records?)__

__Q13: How many sequencing runs/lanes are in this BAM file? (Hint: Do you recall what RG represents?)__

__Q14: What programs were used to create this BAM file? (Hint: have a look for the program record, @PG)__

__Q15: What version of bwa was used to align the reads? (Hint: is there anything in the @PG record that looks like it could be a version tag?)__  

Running __samtools view__ on a BAM file without any options will produce SAM format without the header information. This is printed to the STDOUT in the terminal (screen). Let's have a look at the first read of the BAM file:

In [None]:
samtools view NA20538.bam | head -n 1

Note we only want to look at the first line of the alignment section of the BAM file so we have piped the output of `samtools view` to the `head` command. 

__Q16: What is the name of the first read? (Hint: have a look at the [alignment section](formats.ipynb#Alignment-Section) if you can't recall the different fields)__

__Q17: What position does the alignment start at?__

## CRAM
Even though BAM files are compressed, they are still very large. Typically they use 1.5-2 bytes for each base pair of sequencing data that they contain, and while disk capacity is ever improving, increases in disk capacity are being far outstripped by sequencing technologies.

![Growth of DNA sequencing](img/compression_cram.png)

BAM stores all the data for a sequence read, this includes every base call and every base quality, and it uses a single compression technique for all types of data (numbers, characters etc.). Therefore, CRAM was designed to provide a way to store the same information as BAM but using less disk space. CRAM uses three important concepts:

* Reference based compression
* Controlled loss of quality information
* Different compression methods to suit the type of data, e.g. base qualities vs. metadata vs. extra tags

The figure below displays how reference-based compression works. Instead of storing all the bases of all the reads, only the nucleotides that differ from the reference, and their positions, are kept.

![CRAM1](img/CRAM_format.png)

![CRAM2](img/CRAM_format2.png)

This means that the same information from a BAM file can be stored in CRAM file but using a fraction of the disk space.

## Sorting and Indexing
The reads in a BAM and CRAM file can be ordered or sorted in one of two ways:

* sorted by name, meaning the reads are ordered based on the fragment name so reads from the same fragment (read pairs) will appear next to each other in the file
* coordinate-sorted, meaning the reads that align to the leftmost position or start of the genome appear first in the file. 

To allow for fast random access of regions in BAM and CRAM files, they can be indexed. The files must first be coordinate-sorted. This can be done using __samtools sort__. If no options are supplied, it will by default sort by the left-most position.

In [None]:
samtools sort -o NA20538_sorted.bam NA20538.bam

Now we can use __samtools index__ to create an index file (.bai) for our sorted BAM file:

In [None]:
samtools index NA20538_sorted.bam

You can think on an index file as a lookup table that tools like samtools can use to easily and quickly retrive a segment of the file without having to read the entire file into memory. For example, to look for reads mapped to a specific region, we can use __samtools view__ and specify the region we are interested in as: RNAME[:STARTPOS[-ENDPOS]]. 

If we wanted to look at all the reads mapped to chromosome 1, we could use:

In [None]:
samtools view NA20538_sorted.bam 1 | head -10

To look at the region on chromosome 1 beginning at position 25,000,000 and ending at the end of the chromosome, we can do:

In [None]:
samtools view NA20538_sorted.bam 1:25000000

And to explore the 1001bp long region on chromosome 1 beginning at position 20,000,000 and ending at position 20,001,000, we can use:

In [None]:
samtools view NA20538_sorted.bam 1:20000000-20001000 | tail -10

### Exercises 
__Q18: How many reads are mapped to region 20025000-20030000 on chromosome 1?__

## VCF/BCF
The VCF format is a standard format for storing sequence variation data. The BCF format is the compressed binary version of VCF. Remember that a compressed binary file is not human readable.

VCF is a text based tab-delimited file that is parsable by standard Linux commands. It is composed of two parts, the VCF header and the body. The figure below provides an overview of the different components of a VCF file:

![VCF format](img/VCF1.png)

### VCF header
Header lines are denoted with `##` and provide metadata about the file (e.g. fileformat, fileDate and reference) and metadata defining the fields used in the body of the file (e.g. INFO, FILTER, and FORMAT). These header lines consist of key=value pairs and can consist of multiple pairs enclosed by `<>`. More information about these fields is available in the [VCF specification](http://samtools.github.io/hts-specs/VCFv4.3.pdf).

All header lines are optional and can be put in any order, except for _fileformat_. This holds the information about which version of VCF is used and must come first in the file. 

### VCF body
The body of the VCF follows the header, and is tab separated into 8 mandatory columns and an unlimited number of optional columns that may be used to record other information about the sample(s). 

#### Header line
The header line starts with `#` and contains the names of the columns used in the file:

1. CHROM: an identifier from the reference genome
2. POS: the reference position
3. ID: a list of unique identifiers (where available)
4. REF: the reference base(s)
5. ALT: the alternate base(s)
6. QUAL: a phred-scaled quality score
7. FILTER: filter status
8. INFO: additional information

If the file contains genotype data, additional fields are included in the file. These are a FORMAT column, and then a number of sample IDs. The FORMAT field defines the data types and order of the information for each sample. Some examples of these data types are:

* GT: Genotype, encoded as allele values separated by either of / or |
* DP: Read depth at this position for this sample
* GQ: Conditional genotype quality, encoded as a phred quality

#### Positions
Following the header line is a series of rows containing information about a position in the genome along with genotype information at that position for each of the samples.

Let's look at a specific example:

![VCF Example Line](img/vcf_example.png)

The locations is position 3 on chromosome 1, the variant site does not have an identifier in a standard database. The reference base is an A at this position and all the alternative alleles called in all samples is G. There is no quality score for the site and it passes the quality filters that have been applied. The vallues for AC,AN and DP for this site across all samples is AC=67, AN=5400, DP=2809. The definition of AC, AN and DP can be found in the INFO lines in the header of the VCF and AC is allele count, AN is allele number and DP is read depth. The remainder of the columns provide information about the site in each sample. First the FORMAT column describes the information listed for each sample, GT, PL, DP, GQ. Again the definition of these will be found in the VCF header and are defined as genotype, genotype liklihoods, read depth and genotype quality. This means for SAMPLE1 the geneotype is 1/1 (G/G), the genotype liklihoods are 0,9,73, the depth at this position in this sample is 26 and the genotype quality is 22. Similarily for SAMPLE2 the geneotype is 0/0 (A/A), the genotype liklihoods are 0,9,73, the depth at this position in this sample is 13 and the genotype quality is 22.

### BCF
VCF files can be compressed (for example with gzip), but even compressed they can still be very large. For example, a compressed VCF with 3781 samples of human data will be 54 GB for chromosome 1, and 680 GB for the whole genome. 

VCFs can also be slow to parse, as text conversion is slow. The main bottleneck is the "FORMAT" fields. For this reason the BCF format, a binary representation of VCF, was developed. In BCF files the fields are rearranged for better compresion and fast access. The following images show the process of converting a VCF file into a BCF file. 

![VCF2](img/VCF2.png)

![VCF3](img/VCF3.png)

`Bcftools` comprises a set of programs for interacting with VCF and BCF files. It can be used to view or extract records from a region and to convert between VCF and BCF formats.

#### bcftools view  
Let's have a look at the header of the file 1kg.bcf in the data directory. Note that `bcftools` uses __`-h`__ to print only the header, while samtools uses __`-H`__ for this. 

In [None]:
bcftools view -h 1kg.bcf

Similarly to BAM, BCF supports random access, that is, fast retrieval from a given region. For this, the file must be indexed:

In [None]:
bcftools index 1kg.bcf

Now we can extract all records from the region 20:24042765-24043073, using the __`-r`__ option. The __`-H`__ option will make sure we don't include the header in the output:

In [None]:
bcftools view -H -r 20:24042765-24043073 1kg.bcf

#### bcftools query  
The versatile __bcftools query__ command can be used to extract any VCF field. Combined with standard Linux commands, this gives a powerful tool for quick querying of VCFs. Have a look at the usage options:

In [None]:
bcftools query -h

Let's try out some useful options. As you can see from the usage, __`-l`__ will print a list of all the samples in the file. Give this a go:

In [None]:
bcftools query -l 1kg.bcf

Another useful option is __`-s`__ which allows you to extract all the data relating to a particular sample. Try this for sample HG00131:

In [None]:
bcftools view -s HG00131 1kg.bcf | head -n 50

The format option, __`-f`__ can be used to select what gets printed from your query command. For example, the following will print the position, reference base and alternate base for sample HG00131, separated by tabs:

In [None]:
bcftools query -f'%POS\t%REF\t%ALT\n' -s HG00131 1kg.bcf | head

### Exercises
Now, try and answer the following questions about the file 1kg.bcf in the data directory. For more information about the different usage options you can open the [bcftools query manual page - http://samtools.github.io/bcftools/bcftools.html#query)](http://samtools.github.io/bcftools/bcftools.html#query) in a new tab.

__Q19: What version of the human assembly do the coordinates refer to?__  

__Q20: How many samples are there in the BCF?__

__Q21: What is the genotype of the sample HG00107 at the position 20:24019472? (Hint: use the combination of -r, -s, and -f options)__

## Summary

The figure below summarises the data formats we have looked at so far.

![Data formats summary](img/formats_summary.png)

## GFF

One final format worth mentioning is the GFF format. The general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences. The format consists of one line per feature, each containing 9 columns of data, seperated by a tab. 

Let's look at an example:

```
1	Prodigal	CDS	210	1422	.	-	0	ID=s01
1	Prodigal	CDS	508	2464	.	-	0	ID=s02;product=hypothetical protein
1	Prodigal	CDS	967	3525	.	-	0	ID=s03;Name=rfuC;db_xref=COG:COG403
```

We can see that for each line we have nine fields all seperated by the tab character:

1. __seqname__ - The name of the sequence where the feature is located. 
2. __source__ - The algorithm or procedure that generated the feature. This is typically the name of a software or database.
3. __feature__ - The feature type name, like "gene", "exon" or "cds". 
4. __start__ - The start position of the feature, with sequence numbering starting at 1.
5. __end__ - The end position of the feature, with sequence numbering starting at 1.
6. __score__ - A numeric value that generally indicates the confidence of the source in the annotated feature. A value of "." (a dot) is used to define a null value.
7. __strand__ - Single character that indicates the strand of the feature. This can be "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined), or "?" for features with relevant but unknown strands.
8. __frame__ - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
9. __attribute__ - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

Congratulations you have reached the end of the data formats tutorial!

If you have time then continue to the additional (optional) section of the tutorial: [Converting between formats](conversion.ipynb).   