# Preparing the input data for Roary

`Roary` requires annotated genome assemblies in GFF3 format as input. 

A __genome assembly__ is an attempted reconstruction of the complete DNA sequence of an organism's genome from the short DNA fragments generated by high-throughput sequencing technologies.

An __annotated genome assembly__ is a genome assembly that has been further analyzed to identify the various functional elements present in the DNA sequence. It contains information about genes, regulatory elements, non-coding regions, repetitive elements, and other biologically relevant features.

For this tutorial, we have pre-prepared the genome assemblies of three _Streptococcus pneumoniae_ isolates using sequence data obtained from one of the public sequence archives, the ENA (European Nucleotide Archives). The three assembled *S. pneumoniae* genomes are located in a directory called "assemblies".

In [None]:
ls assemblies

These data and assemblies are also available for download from the ENA and the accession numbers for the assemblies and the sequence data are included below for reference.

|Name    |Genome Accession |Data Accession|
|--------|---------        |--------------|
|sample1 |GCA_900194945.1  |ERR657006     |
|sample2 |GCA_900194155.1  |ERR657305     |
|sample3 |GCA_900194195.1  |ERR657310     |

## Annotating genome assemblies
We must now annotate our genome assemblies to produce GFF3 files for `Roary`. The GFF3 files must include the nucleotide sequence at the end of the file, and to make it easier to identify which isolate each gene came from, each GFF3 file should have a unique locus tag (identifier) for the genes. `Prokka` is a tool that performs genome annotation. All GFF3 files created by `Prokka` are valid with `Roary` and therefore this is the recommended way of generating the input files.

To run `Prokka` on a single file using the default settings, you would use something like:

    prokka sample1.fasta

If you have many assemblies, running this for each sample will soon become tedious. Instead, we will use a for-loop to run `Prokka` on all the fasta files in the assemblies directory. 

In [None]:
for F in assemblies/*.fasta; do 
    FILE=${F##*/}; PREFIX=${FILE/.fasta/};
    prokka --locustag $PREFIX --outdir annotated_$PREFIX --prefix $PREFIX $F; 
done

For each fasta file in the assemblies directory, this will set FILE to be the filename without the text 'assemblies/' (e.g. sample1.fasta ) and set the value of PREFIX to be the text found before '.fasta' (e.g. sample1) and the value of $F will be the path to the fasta file (assemblies/sample1.fasta). We have also used the following options for Prokka:

|Option       |Description                       |
|-------      |-----------                       |
|`--locustag` |The locus tag prefix              |
|`--outdir`   |The directory to put the output   |
|`--prefix`   |The prefix for the output files   |

By providing a unique value for the `--locustag` option we make it easier to identify which sample different genes came from when we look at the results from Roary. The `--outdir` and `--prefix` options will make it easier for us to keep track of our files. 

This is going to take around 5 minutes or longer to run, so be patient. Perhaps read the next section [Performing QC on your data](qc.ipynb) come back here when Prokka is finished running.

Once this has finished, you should have three new directories called `annotated_sample1`, `annotated_sample2` and `annotated_sample3`. Have a look to see that it worked:

In [None]:
ls -l

In [None]:
ls -l annotated_sample1

As you can see, for sample1 we now have a number of annotation files. There is more information about the different output files, along with information about other usage options, on the [Prokka GitHub page (https://github.com/tseemann/prokka)](https://github.com/tseemann/prokka). For now, we are only interrested in the GFF files that were generated as this is what we are going to use as input for Roary.


## Check your understanding
**Q3: Why do we need to run Prokka?**  
a) It will perform QC on our data  
b) It will annotate our data  
c) We don't, Roary can handle fasta files as input  
  
**Q4: Why do we use the --locustag option when we run Prokka?**  
a) To make it easier to keep track of the output files  
b) Because Roary won't work without it  
c) To make the Roary results easier to interpret  

Now continue to the next section of the tutorial: [Performing QC on your data](qc.ipynb).