## Genome annotation with Prokka

Now that we have some extra information about our bins, we can continue to analyse the high-quality bins. The final CheckM results will give you a good overview of the bins with low contamination and high completeness and show the bin's lowest taxonomic rank. Pick a bin that you think is interesting to study further. Alternatively, you may also make a loop to annotate multiple bins.

With this selected bin(s), we are going to do some genome annotation. Whole-genome annotation is the process of identifying features of interest in a set of genomic DNA sequences and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

<b>Assignment:</b><br>
Have a look at the prokka help pages below and come up with the right command to run prokka. (HINT: look at the Usage tag to run prokka on default mode, think of threads, use `--cpus 1 --centre X --compliant` to stop prokkas complaints about ugly contig names, direct the output towards data/prokka).

make a directory for the prokka output:

In [None]:
mkdir 

prokka help page:

In [None]:
prokka -h

run prokka:  This may take a while.

In [None]:
prokka ..path/to/bin/ --outdir ..output/path/..  --cpus 1 --centre X --compliant


(think and discuss these questions) <br>

What did you do? <br>
Where is your output? <br>
What does your output look like? <br>
Do you know what all the output files are or mean? <br>

(HINT: you can look at the prokka files in the same way we looked at the reads and scaffold earlier).

<b>investigating prokka output</b><br>
The prokka output is very elaborate and can be used to many ends. We will quickly visualise the output for the purpose of this practical. To investigate the prokka output, you can use two webservers that both can place the annotations from prokka in metabolic KEGG pathways.<br>

(1) prokka gives UniProt IDs in the gff files first we will collect these IDs,
(2) since we have Uniprot IDs from prokka, we are going to convert these to **KO IDs** that can be used by KEGG using this website:<br>
http://www.uniprot.org/uploadlists/


(3) Make sure to download the 'mapping table'. This table is the input for KEGG. The 'target list' you can upload on the pathways.embl.de webpage. In these web pages (links below), you can investigate the metabolism of your specific bin (in the ideal world: your specific microbe) by investigating which genes make complete pathways.

You may feel overwhelmed by the number of pathways, modules and genes available to you. For this specific case, we are interested in Nitrogen metabolism. You may have a look at the Dijkhuizen et al. 2018 paper on Azolla endophytes. Figure 4 of that paper shows the nitrogen metabolism of multiple microbes is plotted. Next, some hypotheses are derived and tested in the wet lab. Does your plot of the nitrogen metabolism overlap with the one published? Or did you maybe discover a new endophyte!

http://www.genome.jp/kegg/tool/map_module.html<br>
https://pathways.embl.de/ipath3.cgi?map=metabolic<br>

view prokka output: (look for the `.gff` file)

In [None]:
grep -v '#' ....yourprokkaoutput.gff | head

take the uniprot IDs out of the gff file

In [None]:
grep -o 'UniProt.*' ....yourprokkaoutput | cut -d';' -f1 | cut -d':' -f2 > listofuniprotIDs