# Lecture 8 - Assembly, quality assessment, and predictions II

The purpose of this tutorial is to continue exploring genome assemblies and the effect of read depth and error rates in the contiguity of assemblies. The first section of this tutorial will use BUSCO ([Simão et al., 2015](https://academic.oup.com/bioinformatics/article/31/19/3210/211866)) to evaluate the quality of the genome assemblies we have done so far: The Bacteria *Borrelia* and *Takifugu* fish species.

---

## 1. Setting up the environment

As always, we need to create an environment with the packages we will use throughout the tutorial.
BUSCO [has a list of requirements](https://busco.ezlab.org/busco_userguide.html#manual-installation) that have [requirements themselves](https://github.com/Gaius-Augustus/Augustus). It can be diffifult to install it succesfully and it requires a little bit of patience.

Try the first formula. If the environment is not solving and the process is killed, then try the second formula.
If you already have installed BUSCO from the previous tutorial, go ahead and use that installation. You can check if your installation is working correctly by calling the help menu `busco -h`

### 1.1. First formula

In [None]:
%%bash

# FIRST RECIPE
# GO to your working environment and create your folder for this tutorial
mkdir lecture8
cd lecture8

# just in case you have a conda environment active; e.g. that the prompt has a name within brackets other tan "(base)"
conda deactivate

# create the environment with the first set of packages.
# this recipe includes Augustus, although we won't use it per se
# I recommend you create this environment in this order. In particular, installing Augustus and hmmer before BUSCO
conda update -n base conda
conda create -n busco augustus=3.2.3 samtools bcftools htslib boost zlib bamtools

# activate the busco environment once the packages above are installed
conda activate busco

# continue installing the following packages
conda install -c bioconda blast
conda install -c bioconda hmmer
conda install -c agbiome bbtools
conda install -c bioconda metaeuk sepp
conda install -c anaconda pandas

# finish setting up the environment by installing busco
conda install -c bioconda busco

# test the installation
busco -h
makeblastdb -h
samtools view

# if the three above work, you are very lucky!

# if you have an error regarding libcrypto and libssl:
# you MUST run the following lines for things to run
cd ~/anaconda3/envs/busco/lib
ln -s libcrypto.so.1.1 libcrypto.so.1.0.0
ln -s libssl.so.3 libssl.so.1.0.0

# go back to your working folder
cd path/to/folder/lecture8

### 1.2. Second formula - go for this if something isn't working or if it is taking you too long to install the one above

The installation above has a lot of moving gears and it is hard to get everything to work. Moreover, gears to synchronise are added by all of us having different versions of Conda, Ubuntu, and Operative Systems. When this happens, the best option for research reproducibility and running software smoothly, is to use [containers](https://en.wikipedia.org/wiki/Containerization_(computing)). Containers are completely closed systems. Unlike the environments, they include all packages and pieces of instructions to be able to run them regardless of the operative system's configuration. Environments like conda help you install packages but the environments can leak (when a package is using a path to a library outside the environment) andare not reproducible when the operative systems have different configurations. Basically, environments can work in some settings and not in others (as we are very well aware of, by now).

This formula installs a containerisation program called [Singularity]() using conda and an environment for the singularity software. Then, we activate that environment and use Singularity to run BUSCO.

For the installation:

In [None]:
%%bash

conda deactivate

cd /path/to/lecture8

conda create -n singularity -c conda-forge singularity
conda activate singularity

# this will use singularity to get the container that contains BUSCO v.5.2.2
singularity pull docker://ezlabgva/busco:v5.2.2_cv1

### 1.3. Get the data read

Once you \*succesfully\* install BUCSO, we can proceed to get the genome assemblies we will use for quality assessment and the single-copy ortholog gene databases from BUSCO

Download the following files from **emokymai**, Lecture * folder:

- bacteria.tar.gz
- takifugu.tar.gz

Place those files within the lecture8 folder in your computer (your current working folder). Then, proceed by using `tar` to untar (\*tar) and decompress (\*gz) the files in your working folder.


In [None]:
%%bash

# place the files in your working directory "lectuer8"
cd lecture8

# decompress and un-tar the *tar.gz files
tar xvzf bacteria.tar.gz
tar xvzf takifugu.tar.gz

If for some reason you cannot run BUSCO and BUSCO cannot download its own databases (it does that by default), you can download them from their repository using `wget`

**No neet to run it**, it is just for you to know :)

In [None]:
%%bash

wget https://busco-data.ezlab.org/v5/data/lineages/spirochaetales_odb10.2020-03-06.tar.gz
wget https://busco-data.ezlab.org/v5/data/lineages/actinopterygii_odb10.2020-03-06.tar.gz

# un-tar and decompress the files so that BUSCO can use them
tar xvzf spirochaetales_odb10.2020-03-06.tar.gz
tar xvzf actinopterygii_odb10.2020-03-06.tar.gz

---

## 2. Running BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) is a program for assessing the quality of genome assemblies. It evaluates the completeness and accuracy of genome assemblies by predicting and annotating single-copy orthologous genes that are expected to be present in all members of a specific taxonomic group.

BUSCO runs in a single command that evokes the following steps:

1. Loads a reference database of single-copy orthologous genes that are conserved across the taxon, which the user specifies.
2. It predicts genes in the input assembly using [prodigal](https://github.com/hyattpd/Prodigal) for prokaryotes and [Augustus](https://github.com/Gaius-Augustus/Augustus) or [metaeuk](https://github.com/soedinglab/metaeuk) for eukaryotes. This step narrows down the regions of the assembly to analyse further.
3. Once genes are predicted, BUSCO uses [HMMER](https://github.com/EddyRivasLab/hmmer) to evaluate orthology and duplication of single-copy genes on the predicted genes. At this step, BUSCO takes the reference dabatase of single-copy genes (note that the database consists of HMMER profiles!) and feeds it to HMMER, which then predicts similar genes in the set of regions predicted before.
4. BUSCO classifies predicted genes based on the number of copies and their completeness as annotated by HMMER. If the orthologues are annotated once in the set of predicted genes, then that gene is classified as "single complete". If it is annotated more than once in the set of predicted genes, then the gene is classified as "duplicated". If only a partial gene is annotated, then it is classified as "fragmented". If it is not annotated, then it is presummed "missing".

### 2.1. The Borrelia dataset

We will run BUSCO on the genome assemblies for the *Borrelia* Bacteria that we assembled in the last practical. For those assemblies, we simulated Illumina reads with 1x and 3x error rates, and sequencing efforts of 5x, 10x, and 20x read depth. Take the singularity code and make it into a **script** for running BUSCO in a single sample so that you can use in the future with other samples and lineages. Then, create a **loop** to run that script in all the bacterial assemblies.

In the following command, you are first calling `singularity` to start the container or "virtual machine". Then you are telling singularity to `run` a command within the container. In this case, your command is `busco -l lineage -m genome -i assembly.fasta --out prefix_busco -f`. The argument `-B` specifies the "source" and "destination" paths (or outside and inside paths) that singularity uses to mount the container. Both paths must be specified with the format: src\[:dest\[:opts\]\], where opts is "mount options". The mount options can be 'ro' (read-only) or 'rw' (read/write, the default).

Because a container is meant to be a closed system, singularity allows you access your local paths using [bind mounts](https://docs.sylabs.io/guides/3.5/user-guide/bind_paths_and_mounts.html). Bassically, bind mounts allow you to read and write data on the host system easily. There are two ways of specifying bind mounts: system-defined bind paths and user-defined bind paths. **System-defined bind paths** are derived from your system's variables, e.g. `$HOME` and `$PWD`. **User-defined bind paths** are passed to `singularity run` via the `-B` argument and it is a path chosen by the user because the input data is there, or it is the user's working folder.

Finally, `*.sif` files are container files, in our case, .

busco file `busco_v is the file containing BUSCO's container.

Regarding BUSCO's command `busco -l lineage -m genome -i assembly.fasta --out prefix_busco -f`, lineage (`-l`) should be replaced by the name of a lineage database depending on the organism that you are assembling. You can check the list [here](https://busco-data.ezlab.org/v5/data/lineages/). It is advisable to chose the lowest taxonomic level possible that is closest to your organism. For example, we know that *Borrelia* belongs to Spirochaetales, thus, we can use the closest database to Spirochaetales instead of simply using the "bacteria" database. It is also advisable to compare results obtained from using different databases. The data method for analysis (`-m`) can be genome or transcriptome (amongst others). `-f` forces BUSCO to delete previous runs with the same `--out` prefix name before continuing.v1.sif` using 

In [None]:
%%bash

# This is how you run BUSCO by launching the singularity "sandbox"
singularity run -B $PWD busco_v5.2.2_cv1.sif busco -l lineage -m genome -i assembly.fasta --out prefix_busco -f


<div class="alert alert-block alert-success">
<b>TO DO: </b>Create a script file to run BUSCO's container on individual samples. Use Bash's positional variables to pass either a file name prefix or a file name and the information you need to run the command. Then, create a loop to run your script in all bacterial genome assemblies. Copy-paste your code in your <i>answersL8_name.txt</i> file.</div>

Solution hidden here

<!--

# create a script file with nano:

nano busco_singularity.sh

# create the script using positional variables in bash
# https://www.computerhope.com/jargon/p/positional-parameter.htm
# pay particular attention to the paths and make sure they reflect your folder structure!

#-------------- BEGINING OF THE FILE
#! /bin/bash

# $1 is the path to the *scaffold.fasta file including the file name and from the current directory
# $2 is the lineage
# run as ./singularity_busco.sh absolute/path/ spirochaetales

NAME=${1##*/}

echo "Running from folder $PWD"
echo "Processing sample {NAME}"


singularity run -B $PWD busco_v5.2.2_cv1.sif busco -l $2 -m genome -i $PWD/$1 --out "${NAME%.*"_busco -f


echo "Don
#-------------- END OF THE FILE

# create a loop to run it

for SAMPLE in bacteria/*_out; do
    ./singularity_busco.sh ${SAMPLE}/*fastaspirochaetalesi_odb10;
do
nee with ${NAME}"
-->

### 2.2. The Takifugu fish dataset

Use the same code and scripts you created above to run BUSCO on two or more fish assemblies.

---

## 3. Checking out the results

We have generated a lot of folders with the output from running BUSCO on the bacteria and fish assemblies. How can we quickly and efficiently check which samples did better than others?

BUSCO results are summarised in a file with prefix `short_summary.specific.*` located within BUSCO's output folder, which should have the prefix name you passed to BUSCO using the `--out` argument. In the example above, the command passed told BUSCO to generate an output folder called "prefix_busco". The file contains a summary string that shows the percentage of complete (single and duplicated), fragmented, and missing ortholog genes.

The summary looks like this:

`C:34.8%[S:34.8%,D:0.0%],F:9.6%,M:55.6%,n:345`

The summary is followed by the count of genes:

- 120     Complete BUSCOs (C)  
- 120     Complete and single-copy BUSCOs (S)  
- 0       Complete and duplicated BUSCOs (D)  
- 33      Fragmented BUSCOs (F)  
- 192     Missing BUSCOs (M) 
- 345     Total BUSCO groups searched


<div class="alert alert-block alert-success">
<b>TO DO: </b>Look at your folder structure. Write a script to print into standard output (STDOUT, your screen) the name of the sample or assembly and the BUSCO short summary string. Copy-paste your code to your <i>answersL8_name.txt</i> file and append the result of your code with <i>>></i>.</div>

Solution hidden here

<!--

# my BUSCO output folders for the bateria data are called like this:
sim5error1_busco
sim5error3_busco
sim10error1_busco
sim10error3_busco

# etc..

my code would be:

for sample in *_busco; do echo $sample; grep '%' $sample/short*txt; done
nee with ${NAME}"
-->

<div class="alert alert-block alert-success">
<b>TO DO: </b>Which read-depth and error rate resulted in the most complete genome assembly for <i>Borrelia</i>? Which read-depth and error rate resulted in the worst genome assembly? Add your answer to your <i>answersL8_name.txt</i>.</div>

---

## 4. Reference-based assembly and BUSCO scores

We have evaluated the BUSCO scores of our Bacteria (and fish) assemblies and we understand the effect of read-depth and error rate in the quality of the assembly. Now, we will generate two reference-based genome assembly of *Borrelia* using `SPAdes` and two sets of Illumina reads we simulated during the last practical. One set will have low read-depth (5x) and the other high read-dept (20x); both sets will have error rate 3x. After we assemble the genomes again, we will run BUSCO on both assemblies and compare how much the assembly improves when there is a reference genome available to guide the assembly process.

Use the same environment from the last practical (`reseq`), the Illumina reads we simulated (`sim5error3*` and `sim20error3*`), and the longest contig you retrieved from the reference genome (``).

In [None]:
%%bash

# deactivate the current environment
conda deactivate

# activate the last environment
conda activate reseq

# run SPAdes using the longest contig as reference and 