# NGS Data Formats

## Introduction
In this tutorial we will introduce several common data formats used for sequence data. We will cover the following formats:

__FASTA__ - This format is used to store nucleotide sequences  
__FASTQ__ - This format is used to store nucleotide sequences and corresponding quality scores  
__SAM/BAM__ - This format is used to store unaligned or aligned (matched to a reference genome) nucleotide sequences  
__CRAM__ - This format is similar to BAM but has better compression than BAM  
__VCF/BCF__ - This format is used to store sequence variation (SNPs, indels, structural variations)  
__GFF__ - This format is used to store sequence feature information (genes, repeats, tRNAs)

## Learning outcomes
On completion of the tutorial, you can expect to be able to:

* Describe the different data formats used for sequence data (FASTA, FASTQ, SAM/BAM, CRAM, VCF/BCF, GFF)
* Perform conversions between the different data formats

## Tutorial sections
This tutorial comprises the following sections:   
 1. [Data formats for NGS](formats.ipynb)  
 2. [Converting between formats](conversion.ipynb)
  
## Authors and License
This tutorial was written by [Jacqui Keane](https://github.com/jacquikeane) and [Sara Sjunnebo](https://github.com/ssjunnebo) based on material from [Petr Danecek](https://github.com/pd3) and [Thomas Keane](https://github.com/tk2).

The content is licensed under a [Creative Commons Attribution 4.0 International License (CC-By 4.0)](https://creativecommons.org/licenses/by/4.0/).

## Running the commands in this tutorial
You can follow this tutorial by typing all the commands you see in a terminal window on your computer. Remember, the terminal window is similar to the "Command Prompt" window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, open a terminal window and type the command below followed by the `Enter` key:

In [None]:
cd ~/course_data/data_formats/data

## Prerequisites
This tutorial assumes that you have the following software and their dependencies installed on your computer. The software used in this tutorial may be updated from time to time so, we have also given you the version which was used when writing this tutorial.


| Package name | Link for download/installation instructions                          | Version |
| :----------: | :------------------------------------------------------------------: |:------: |
| samtools     | https://github.com/samtools/samtools                                 | 1.17    |
| bcftools     | https://github.com/samtools/bcftools                                 | 1.17    |
| picard-slim  | https://broadinstitute.github.io/picard/                             | 3.0.0   |

The easiest way to install the required software is using `conda`, a software package manager. These software have already been installed on the computer for you. To activate them type:

In [None]:
conda activate formats

After the software is activated type the following commands:

In [None]:
samtools --help

In [None]:
bcftools --help

In [None]:
picard -h

This should return the help message for these tools.

To get started with the tutorial, go to the first section: [Data formats for NGS](formats.ipynb)