# Module 3: Variant Lineage Identification

Welcome to the module! There are some very important instructions for you to follow:

1.) Click on File on the top left corner and select save a copy in drive

**Your changes will not be saved if you do not do this step**

2.) Click on the name of the workbook in the top left corner and replace "Copy of" with your full name

## Setting up

## Installing Conda
Conda is a versatile software management tool. Conda is an open source system of managing tools and libraries. More info on the library used to install conda on Google Colab is at this [website](https://inside-machinelearning.com/en/how-to-install-use-conda-on-google-colab/)

Note - your runtime will refresh and reconnect after running this. It will say runtime crashed, this seems normal, wait for the session to reconnect after this.

You can check out this repo for how this tool works:
https://github.com/conda-incubator/condacolab 


In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

## Testing your conda installation
After successfully install condacolab, you need to be sure that you can call conda from your shell and install software and also execute them. 
The `--help` command is an easy way to test that it is properly installed as it will display the help options used in running the tool.

In [None]:
#You can check conda installation by launching the help command
!conda --help


## Check pre-installed conda packages
After conda is installed, it comes with some packages and software that are installed alongside the software. To generate a list of these packages, simply run the code below:

In [None]:
!conda list

## Download and navigate to directory containing genomes
There are some test SARS-CoV-2 genome assemblies obtained from GISAID, saved in the file `assemblies_gisaid.zip`. Download and unzip the file, which will create a folder called `assemblies_gisaid`. Then use the `cd` command to switch to that directory:

In [None]:
!wget https://wcs_data_transfer.cog.sanger.ac.uk/assemblies_gisaid.zip
!unzip assemblies_gisaid.zip

In [None]:
%cd assemblies_gisaid
!pwd

## Merging multiple files
In some cases like we do, you will have multiple individual files you want to analyze together, especially FASTA files. In order to merge these files as a **single** multi-FASTA file, we will use the concatenate command `cat`, and redirect the output to a file called `combined_genomes.fasta`:

In [None]:
# Let's concatenate all FASTA files using a wildcard
!cat *.fasta > combined_genomes.fasta

## Practice questions
Please answer the following questions in the code blocks below

1.   How do you list the contents of this directory?
2.   How do you view the contents of one of the FASTA files? 

Enter your answers below and launch the command.

In [None]:
# 1. List the contents of this directory 
# enter your code below


In [None]:
# 2. View the contents of one of the FASTA files 
# enter your code below


# Pangolin

PANGOLIN is an acronym for Phylogenetic Assignment of Named Global Outbreak Lineages. It is a software that assigns your SARS-CoV-2 genome to the most closest related SARS-CoV-2 lineage on the global context based on the mutations in the query sequence. Pangolin can be accessed on its web application which allows you to upload your FASTA files here: https://pangolin.cog-uk.io/. 

### Installing Pangolin for variant detection
Pangolin can be installed within conda by running the code below:

In [None]:
!conda install -c bioconda -c conda-forge -c defaults pangolin=4.1.1

In [None]:
# Update tabulate dependency to prevent errors associated with Pangolin
!pip install tabulate==0.8.10

To check that Pangolin installed correctly, run the following command:

In [None]:
!pangolin --help

### Launching Pangolin on the multi-FASTA file
Now you have successfully installed Pangolin and displayed the help options which shows you the commands and options you can adjust to execute your analysis with the software.

We will now try to identify the variants of the genomes in the merged FASTA file using Pangolin with the default settings; and direct our output to the `results` folder. By default, the output file will be called `lineage_report.csv`.

In [None]:
!pangolin --outdir results combined_genomes.fasta

**NOTE**: Download the `lineage_report.csv` file to your computer and open it with a Spreadsheet software like MS Excel.

# Nextclade

Nextclade is another fantastic software that can identify variants in your genomes like Pangolin. However, it has additional features in its report, like the number of mutations across the genome and where they are located. Nextclade can be accessible on the web application: https://clades.nextstrain.org/.

### Installing Nextclade

Install Nextclade with conda using the following command, and then test the installing by displaying the help page.

In [None]:
!conda install -c bioconda nextclade

In [None]:
# To check if Nextclade insalled properly
!nextclade --help

### Running NextClade

Before running Nextclade on the assembled genomes, first download the latest available Nextclade dataset for SARS-CoV-2.

In [None]:
# Download the SARS-CoV-2 dataset folder for NextClade
!nextclade dataset get --name sars-cov-2 --output-dir sars-cov-2-dataset

Then identify the clades from the genomes in the multi-FASTA file.

In [None]:
# Run NextClade on your concatenated FASTA files to generate TSV output file
!nextclade run --output-tsv nextclade_report.tsv --input-dataset sars-cov-2-dataset combined_genomes.fasta

**NOTE**: Download the "nextclade_report.tsv" file and open it with a spreadsheet software on your computer like MS Excel.

## Assignment
Complete this Assignment answer the questions on the platform

1. Run Pangolin on the FASTA file with a maximum ambiguity of 50% (0.5). Set the output filename to your first name + "result.csv", and the output directory as your surname (e.g. "steve_result.csv" in a directory named "rogers"). Do this all using just one command. **Hint:** Use the --help parameter to check for the appropriate command

2. From your Nextclade results, what clades are present in the genomes? How many of each are present?

3. Also from your Nextclade results, how many amino acid substitutions does the genome from Italy have?

4. Can you tell how many missing nucleotides are in the genome from Mali from the Nextclade report?

In [None]:
# 1.  Name your output file with your first name + "result.csv", and the output directory as your surname, all using only one command.
# Enter your code below



In [None]:
# 2. What clades are present among the genomes? How many of each clade are present?
# Enter your answer below with python print command



In [None]:
# 3. How many amino acid substitutions does the genome from Italy have?
# Enter your answer below with python print command



In [None]:
# 4. How many missing nucleotides are in the genome from Mali?
# Enter your answer below with bash echo command

