In [None]:
%%bash
for i in ./metagenomics/mapping/*.bam ;
    do echo $i.sorted
    samtools sort -o "$i.sorted" "$i"
done

remove the samtools view directory:

In [None]:
%%bash
pwd
du -h ./metagenomics/

In [None]:
!rm -rf ....yourdirectoryname

samtools help page:

In [None]:
!samtools view

run samtools view to view your BAM files:

In [None]:
!samtools view ....yourbamsortedbam | head

<h2> Binning with MetaBat </h2>

Now that we have the reads, the scaffolds and the mapping of the reads on the scaffolds we can continue with the binning. In metagenomics, binning is the process of grouping reads or contigs and assigning them to operational taxonomic units. Binning methods can be based on either compositional features or alignment (similarity), or both. Metabat uses both the contig depth and tetra-nucleotide frequencies to bin the contigs. Every bin will represent one operational taxonomic unit that can be found in the metagenome.

<b>Assignment:</b><br>
Run the script provided in /metagenomics/scripts/jgi_summarize_bam_contig_depths to calculate the average depth per contig. <br>
Then run metabat to bin the contigs, dont forget to include the previously included contig depths in the metabat command.

In [None]:
!./metagenomics/scripts/jgi_summarize_bam_contig_depths -h

calculate the contig depth per scaffold<br>
hint: a glob looks like this directoryname/* and includes all files included in the directory

In [None]:
!./metagenomics/scripts/jgi_summarize_bam_contig_depths --outputDepth 

metabat help page:

In [None]:
!metabat -h

make a new directory:

In [None]:
!mkdir

run metabat:

In [None]:
!metabat -i -o -t

<h2> Checking completeness and contamination using CheckM </h2>

Now that we have the bins made with metabat we can check them for contamination and completeness (quality), for this we will use CheckM. CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage (also called marker/signature genes). http://ecogenomics.github.io/CheckM/

As you will be able to see in the checkm help pages, checkm has a workflow (lineage_wf) that will run all nessecary steps to asses bin quality. 

Lineage_wf (lineage-specific workflow) steps: <br>
- The tree command places genome bins into a reference genome tree. <br>
- The lineage_set command creates a marker file indicating lineage-specific marker sets suitable for evaluating each genome. <br>
- This marker file is passed to the analyze command in order to identify marker genes and estimate the completeness and contamination of each genome bin.  <br>
- Finally, the qa command can be used to produce different tables summarizing the quality of each genome bin. <br>


<b>Assignment:</b><br>
Sadly for this exercise the virtual machines we are using are not powerfull enough, therefore we provide the results of the first steps of the CheckM workflow up until the qa command (lineage_wf). Scan the help pages of CheckM to find out the correct command to finish the CheckM analysis. 

(OPTIONAL: you can try to find out what the limiting factor is of lineage_wf using !/usr/bin/time --verbose) <br>

In [None]:
!checkm -h

In [None]:
!checkm lineage_wf -h

checkm qa help page:

In [None]:
!checkm qa -h

run checkm qa:

In [None]:
!checkm qa

(think and discuss these questions) <br>

What did you do? <br>
Where is your output? <br>
What does your output look like? <br>
What can you say about the bins with this output? <br>
What can you say about lineages of the bins?<br>

<h2>Genome annotation with Prokka</h2>

Now that we some extra information about our bins, we can continue to analyze the high quality bins. The final CheckM results will give you a good overview of the bins with low contamination and high completeness and also shows the lowest taxonomic rank of the bin. Pick a bin that you think is interesting to further study.

With this bin we are going to do some genome annotation. Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

<b>Assignment:</b><br>
Have a look at the prokka help pages below and come up with the right command to run prokka. (HINT: look at the Usage tag to run prokka on default mode, think of threads, use --centre X --compliant to stop prokkas complaints about ugly contig names, direct the output towars metagenomics/prokka)

In [None]:
!prokka

make a directory for the prokka output:

In [None]:
!mkdir

prokka help page:

In [None]:
!prokka -h

run prokka:

In [None]:
!prokka --cpus --outdir 

(think and discuss these questions) <br>

What did you do? <br>
Where is your output? <br>
What does your output look like? <br>
Do you know what all the output files are or mean? <br>

(HINT: you can look at the prokka files in the same way we looked at the reads and scaffold earlier)

<b>investigating prokka output</b><br>
to investigate the prokka output you can use two webservers that both are able to place the annotations from prokka in KEGG pathways.<br>

(1) prokka gives uniprot IDs in the gff files first we will collect these IDs,
(2) since we have Uniprot IDs from prokka we are going to convert these to KO IDs that can be used by KEGG using this website:<br>
http://www.uniprot.org/uploadlists/

(3) then you can put your IDs in both websites and investigate the pathways
http://www.genome.jp/kegg/tool/map_module.html<br>
http://pathways.embl.de/iPath2.cgi#<br>

view prokka output:

In [None]:
!less ....yourprokkaoutput.gff | head

take the uniprot IDs out of the gff file

In [None]:
!grep -o 'UniProt.*' ....yourprokkaoutput | cut -d';' -f1 | cut -d':' -f2 > listofuniprotIDs