# Module 4: Variant Lineage Identification

Welcome to the module! There are some very important instructions for you to follow:

1.) Click on File on the top left corner and select save a copy in drive

**Your changes will not be saved if you do not do this step**

2.) Click on the name of the workbook in the top left corner and replace "Copy of" with your full name

## Setting up

## Installing Conda
Conda is a versatile software management tool. Conda is an open source system of managing tools and libraries. More info on the library used to install conda on Google Colab is at this [website](https://inside-machinelearning.com/en/how-to-install-use-conda-on-google-colab/)

Note - your runtime will refresh and reconnect after running this. It will say runtime crashed, this seems normal, wait for the session to reconnect after this.

You can check out this repo for how this tool works:
https://github.com/conda-incubator/condacolab 


In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

## Testing your conda installation
After successfully install condacolab, you need to be sure that you can call conda from your shell and install software and also execute them. 
The `--help` command is an easy way to test that it is properly installed as it will display the help options used in running the tool.

In [None]:
#You can check conda installation by launching the help command
!conda --help


## Check pre-installed conda packages
After conda is installed, it comes with some packages and software that are installed alongside the software. To generate a list of these packages, simply run the code below:

In [None]:
!conda list

## Download and navigate to directory containing genomes
There are some test SARS-CoV-2 genome assemblies obtained from GISAID, saved in the file `assemblies_gisaid.zip`. Download and unzip the file, which will create a folder called `assemblies_gisaid`. Then use the `cd` command to switch to that directory:

In [None]:
!wget https://zenodo.org/records/10888461/files/module_3.tar

In [None]:
!tar -xvf module_3_P1.tar

In [None]:
%cd module_3_P1/
!pwd

The data analyzed in this module are derived from the article "[Overview of the SARS-CoV-2 genotypes circulating in Latin America during 2021]( https://doi.org/10.3389/fpubh.2023.1095202 )" published by Molina-Mora et al. in 2023, as part of a CABANA project.
Below, you will find the Accession ID from the GISAID platform and the country from which the sample originated:

#### Genomes 
| Country     | Accession ID   |
|-------------|----------------|
| Argentina   | EPI_ISL_14434222 |
| Argentina   | EPI_ISL_14434402 |
| Argentina   | EPI_ISL_14434358 |
| Bolivia     | EPI_ISL_8800564  |
| Bolivia     | EPI_ISL_8800607  |
| Bolivia     | EPI_ISL_8800591  |
| Costa Rica  | EPI_ISL_7711628  |
| Costa Rica  | EPI_ISL_7711763  |
| Costa Rica  | EPI_ISL_7711812  |
| Colombia    | EPI_ISL_10072006 |
| Colombia    | EPI_ISL_10072040 |
| Colombia    | EPI_ISL_10080397 |
| Mexico      | EPI_ISL_7812926  |
| Mexico      | EPI_ISL_7812869  |
| Mexico      | EPI_ISL_7813015  |
| Peru        | EPI_ISL_7961355  |
| Peru        | EPI_ISL_7961418  |
| Peru        | EPI_ISL_7961482  |
| Brazil      | EPI_ISL_3369834  |
| Brazil      | EPI_ISL_3369992  |
| Brazil      | EPI_ISL_3373439  |

# Pangolin

PANGOLIN is an acronym for Phylogenetic Assignment of Named Global Outbreak Lineages. It is a software that assigns your SARS-CoV-2 genome to the most closest related SARS-CoV-2 lineage on the global context based on the mutations in the query sequence. Pangolin can be accessed on its web application which allows you to upload your FASTA files here: https://pangolin.cog-uk.io/. 

### Installing Pangolin for variant detection
Pangolin can be installed within conda by running the code below:

In [None]:
!conda install -c bioconda -c conda-forge -c defaults pangolin

To check that Pangolin installed correctly, run the following command:

In [None]:
!pangolin -h

### Launching Pangolin on the multi-FASTA file
Now you have successfully installed Pangolin and displayed the help options which shows you the commands and options you can adjust to execute your analysis with the software.

We will now try to identify the variants of the genomes in the merged FASTA file using Pangolin with the default settings; and direct our output to the `results` folder. By default, the output file will be called `lineage_report.csv`.

In [None]:
!pangolin --outdir results gisaid_hcov-19_2024_03_27_02.fasta

**NOTE**: Download the `lineage_report.csv` file to your computer and open it with a Spreadsheet software like MS Excel.

# Nextclade

Nextclade is another fantastic software that can identify variants in your genomes like Pangolin. However, it has additional features in its report, like the number of mutations across the genome and where they are located. Nextclade can be accessible on the web application: https://clades.nextstrain.org/.

### Installing Nextclade

Install Nextclade with conda using the following command, and then test the installing by displaying the help page.

In [None]:
!conda install -c bioconda nextclade

In [None]:
# To check if Nextclade insalled properly
!nextclade --help

### Running NextClade

Before running Nextclade on the assembled genomes, first download the latest available Nextclade dataset for SARS-CoV-2.

In [None]:
# Download the SARS-CoV-2 dataset folder for NextClade
!nextclade dataset get --name sars-cov-2 --output-dir sars-cov-2-dataset

Then identify the clades from the genomes in the multi-FASTA file.

In [None]:
# Run NextClade on your concatenated FASTA files to generate TSV output file
!nextclade run --output-tsv nextclade_report.tsv --input-dataset sars-cov-2-dataset gisaid_hcov-19_2024_03_27_02.fasta

**NOTE**: Download the "nextclade_report.tsv" file and open it with a spreadsheet software on your computer like MS Excel.

## Assignment

1. From your NextClade results, what is the most persistent clade in the samples?

2. Also from your NextClade results, how many amino acid substitutions does the genome EPI_ISL_7711812 have?

3. What is the genome with the highest number of "missing" entries?