# Read Assembly and Annotation Pipeline Tool (RAPT)

This Jupyter notebook is inteaded to show you how to run RAPT on your own machine. Note that Jupyter is not required, we are merely using for the convience of interleaving documentation with the shell commands. 

## Requirements

To run the PGAP pipeline you will need:
* Python (version 3.5 or higher), 
* the ability to run Docker (see https://docs.docker.com/install/ if it is not already installed),
* about 100GB of storage for the supplemental data and working space,
* and 2GB-4GB of memory available per CPU used by your container.
* Debian 10 is currently not supported.
* The CPU must have SSE 4.2 support (released in 2008). 

## Installation

There are two main components to this pipeline, the SKESA assembler and PGAP annotator. 

1. Install SKESA using the docker image
1. Install PGAP using its control software, `pgap.py`

In [None]:
%%time
!docker pull ncbi/skesa:v2.3.0

In [None]:
%%time
!curl -OL https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
!chmod +x pgap.py
!./pgap.py --taxcheck --update > pgap_update.log

## Running the pipelines

The pipeline can be run on data in NCBI's SRA project, or using your own provided fasta/fastq files. The following example uses a run from SRA, in which case, the data is downloaded automatically.

First we set the parameters:

In [None]:
SRR = "SRR4835107"
genus_species = "Salmonella enterica"
topology = "linear"
bioproject = "PRJNA316728"
biosample = "SAMN04160831"

## Run SKESA

In [None]:
%%time
!docker run --rm ncbi/skesa:v2.3.0 skesa --sra_run $SRR > ${SRR}.skesa.fa

## Create YAML Input files

To run PGAP, we must first create two YAML files to describe the input. The following Python code does this, based upon the variables set above. You don't need to use it, any text editor can create the files as well. The documentation for the files may be found at <https://github.com/ncbi/pgap/wiki/Input-Files> 

In [None]:
input_data = f'''
fasta:
  class: File
  location: {SRR}.skesa.fa
submol:
  class: File
  location: {SRR}_submol.yaml
report_usage: True
'''

submol_data = f'''
topology: {topology}
organism:
    genus_species: '{genus_species}'
    strain: 'replaceme'
contact_info:
    last_name: 'Doe'
    first_name: 'Jane'
    email: 'jane_doe@gmail.com'
    organization: 'Institute of Klebsiella foobarensis research'
    department: 'Department of Using NCBI'
    phone: '301-555-0245'
    street: '1234 Main St'
    city: 'Docker'
    postal_code: '12345'
    country: 'Lappland'
    
authors:
    -     author:
            first_name: 'Arnold'
            last_name: 'Schwarzenegger'
            middle_initial: 'T'
    -     author:
            first_name: 'Linda'
            last_name: 'Hamilton'
bioproject: '{bioproject}'
biosample: '{biosample}'      
# -- Locus tag prefix - optional. Limited to 9 letters. Unless the locus tag prefix was officially assigned by NCBI, ENA, or DDBJ, it will be replaced upon submission of the annotation to NCBI and is therefore temporary and not to be used in publications. If not provided, pgaptmp will be used.
locus_tag_prefix: 'tmp'
publications:
    - publication:
        pmid: 16397293
        title: 'Discrete CHARMm of Klebsiella foobarensis. Journal of Improbable Results, vol. 34, issue 13, pages: 10001-100005, 2018'
        status: published  # this is enum: controlled vocabulary
        authors:
            - author:
                first_name: 'Arnold'
                last_name: 'Schwarzenegger'
                middle_initial: 'T'
            - author:
                  first_name: 'Linda'
                  last_name: 'Hamilton'
'''

# Code for printing to a file 
input = open(f'{SRR}_input.yaml', 'w')   
print(input_data, file = input) 
input.close()

submol = open(f'{SRR}_submol.yaml', 'w')   
print(submol_data, file = submol) 
submol.close()

## Run PGAP

We run pgap using the previously downloaded `pgap.py` utility and also check the taxon using an optional feature which compares the Average Nucleotide Identity to type assemblies. Note that this is the same process described in <https://github.com/ncbi/pgap/wiki/Quick-Start>

In [None]:
%%time
!./pgap.py --taxcheck -o ${SRR}_results ${SRR}_input.yaml