# Pangenomics workflow

## Background

This notebook contains the steps tto create a pangenome for 11 Escherichia-Shigella type strains listed in Table 1.  
This workflow uses [Anvi'o](https://anvio.org/) to calculate and visualise the pangenome.  

**Table 1.** Type strains for pangenomics analysis  

| Assembly Accession | Organism Name | Total Sequence Length | Assembly Level | Assembly Submission Date |
| ------------------ | ------------- | --------------------- | -------------- |  ----------------------- |
| GCF_000005845.2 | Escherichia coli str. K-12 substr. MG1655 | 4641652 | Complete Genome | 2013-09-26 |
| GCF_000006925.2 | Shigella flexneri 2a str. 301 | 4828820 | Complete Genome | 2011-08-03 |
| GCF_000008865.2 | Escherichia coli O157:H7 str. Sakai	5594605 | Complete Genome | 2018-06-08 |
| GCF_002290485.1 | Shigella boydii | 4825405 | Contig | 2017-09-12 |
| GCF_013374815.1 | Shigella sonnei | 4762774 | Complete Genome | 2020-06-27 |
| GCF_016904755.1 | Escherichia albertii | 4631903 | Complete Genome | 2021-03-10 |
| GCF_020097475.1 | Escherichia fergusonii | 4645928 | Complete Genome | 2021-09-23 |
| GCF_020283705.1 | Escherichia whittamii | 4563680 | Scaffold | 2021-10-05 |
| GCF_022354085.1 | Shigella dysenteriae | 5192674 | Complete Genome  | 2022-02-22 |
| GCF_024733345.1 | Escherichia ruysiae| 4893780 | Contig | 2022-08-22 |
| GCF_902709585.1 | Escherichia marmotae | 4584809 | Contig | 2020-07-20 |


## Downloading data

We will use NCBI Dataset command-line tool to download each genome using the assembly accessions from table above. We only need the annotated genbank files (`gbff`), so we can exclude all other formats available for each assembly. 

In [None]:
%%bash
datasets download genome accession \
        GCF_000005845.2,GCF_000006925.2,GCF_000008865.2,\
        GCF_002290485.1,GCF_013374815.1,GCF_016904755.1,\
        GCF_020097475.1,GCF_020283705.1,GCF_022354085.1, \
        GCF_024733345.1,GCF_902709585.1 \
        --include-gbff \
        --exclude-protein \
        --exclude-rna \
        --exclude-genomic-cds \
        --exclude-seq \
        --exclude-gff

The assemblies are downloaded in one folder in  compressed format, which needs to be unpacked.

In [None]:
%%bash

unzip ncbi_dataset.zip -d escherichia_shigella

## File formatting

Each annotated assembly file is stored in separatae folder. We need to keep this in mind when giving the paths to each aassembly file.  
Next step is to format the genbank files into three separate files that can be used by the 

In [None]:
%%bash
for genome in $(ls escherichia_shigella/ncbi_dataset/data/*/genomic.gbff)
do
    name=${genome#escherichia_shigella/ncbi_dataset/data/} 
    name=${name%/genomic.gbff} 
    anvi-script-process-genbank \
        -i $genome \
        --output-fasta escherichia_shigella/ncbi_dataset/data/${name}/${name}-contigs.fasta \
        --output-gene-calls escherichia_shigella/ncbi_dataset/data/${name}/${name}-gene-calls.txt \
        --output-functions escherichia_shigella/ncbi_dataset/data/${name}/${name}-functions.txt \
        --annotation-source prodigal \
        --annotation-version v2.6.3
done

In [None]:
%%bash 

echo -e "name\tpath\texternal_gene_calls\tgene_functional_annotation" > fasta.txt
for strain in $(ls escherichia_shigella/ncbi_dataset/data/*/*-contigs.fasta)
do
    strain_name=${strain#ncbi_dataset/data/*/}
    echo -e ${strain_name%-contigs.fasta}"\t"$strain"\t"${strain%-contigs.fasta}"-gene-calls.txt\t"${strain%-contigs.fasta}"-func
tions.txt"
done >> fasta.txt   

In [None]:
 %%bash
 
 anvi-run-workflow -w pangenomics -c config.json

In [None]:
%%bash

ANVIPORT=YOUR_PORT_NUMBER

anvi-display-pan \
    -g 03_PAN/*-GENOMES.db \
    -p 03_PAN/*-PAN.db \
    --server-only -P $ANVIPORT