# Read Assembly and Annotation Pipeline Tool (RAPT)

This Jupyter notebook is intended to show you how to run RAPT on your own machine. Note that Jupyter is not required, we are merely using for the convience of interleaving documentation with the shell commands. 

## Requirements

To run the PGAP pipeline you will need:
* Python (version 3.6 or higher), 
* the ability to run Docker (see https://docs.docker.com/install/ if it is not already installed),
* about 100GB of storage for the supplemental data and working space,
* and 2GB-4GB of memory available per CPU used by your container.
* Debian 10 is currently not supported.
* The CPU must have SSE 4.2 support (released in 2008). 

## Installation (*skip if using the preconfigured VM*)

There are two main components to this pipeline, the SKESA assembler and PGAP annotator. 

1. Install SKESA using the docker image
1. Install PGAP using its control software, `pgap.py`

In [None]:
time docker pull ncbi/skesa:v2.3.0

In [None]:
curl -OL https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
chmod +x pgap.py
time ./pgap.py --taxcheck --update

## Running the pipelines

The pipeline can be run on data in NCBI's SRA project, or using your own provided fasta/fastq files. The following example uses a run from SRA, in which case, the data is downloaded automatically.

First we set the parameters:

In [None]:
export SRR="SRR4835107"
export GENUS_SPECIES="Salmonella enterica"
export TOPOLOGY="linear"
export BIOPROJECT="PRJNA316728"
export BIOSAMPLE="SAMN04160831"

## Run SKESA

Here are the important options for running skesa. Use `--reads` to specify your own input file(s), and `--sra-run` to specify SRA run accessions which will be downloaded from SRA automatically.

```
Input/output options: at least one input providing reads for assembly must be specified:
  --reads arg                   Input fasta/fastq file(s) for reads (could be 
                                used multiple times for different runs, could 
                                be gzipped) [string]
  --use_paired_ends             Indicates that a single (not comma separated) 
                                fasta/fastq file contains paired reads [flag]
  --sra_run arg                 Input sra run accession (could be used multiple
                                times for different runs) [string]
```



In [None]:
echo $SRR
time docker run --rm ncbi/skesa:v2.3.0 skesa --sra_run $SRR > ${SRR}.skesa.fa

## Create YAML Input files

To run PGAP, we must first create two YAML files to describe the input. The following Python code does this, based upon the variables set above. You don't need to use it, any text editor can create the files as well. The documentation for the files may be found at <https://github.com/ncbi/pgap/wiki/Input-Files> 

In [None]:
cat << EOF_input > ${SRR}_input.yaml
fasta:
    class: File
    location: ${SRR}.skesa.fa
submol:
    class: File
    location: ${SRR}_submol.yaml
report_usage: True
EOF_input

echo "Created ${SRR}_input.yaml"

cat << EOF_submol > ${SRR}_submol.yaml
topology: ${TOPOLOGY}
organism:
    genus_species: '${GENUS_SPECIES}'
    strain: 'replaceme'
contact_info:
    last_name: 'Doe'
    first_name: 'Jane'
    email: 'jane_doe@gmail.com'
    organization: 'Institute of Klebsiella foobarensis research'
    department: 'Department of Using NCBI'
    phone: '301-555-0245'
    street: '1234 Main St'
    city: 'Docker'
    postal_code: '12345'
    country: 'Lappland'
    
authors:
    - author:
        first_name: 'Arnold'
        last_name: 'Schwarzenegger'
        middle_initial: 'T'
    - author:
        first_name: 'Linda'
        last_name: 'Hamilton'
bioproject: '${BIOPROJECT}'
biosample: '${BIOSAMPLE}'      
# -- Locus tag prefix - optional. Limited to 9 letters. Unless the locus tag prefix was officially assigned by NCBI, ENA, or DDBJ, it will be replaced upon submission of the annotation to NCBI and is therefore temporary and not to be used in publications. If not provided, pgaptmp will be used.
locus_tag_prefix: 'tmp'
publications:
    - publication:
        pmid: 16397293
        title: 'Discrete CHARMm of Klebsiella foobarensis. Journal of Improbable Results, vol. 34, issue 13, pages: 10001-100005, 2018'
        status: published  # this is enum: controlled vocabulary
        authors:
            - author:
                first_name: 'Arnold'
                last_name: 'Schwarzenegger'
                middle_initial: 'T'
            - author:
                first_name: 'Linda'
                last_name: 'Hamilton'
EOF_submol

echo "Created ${SRR}_submol.yaml"

## Run PGAP

We run pgap using the previously downloaded `pgap.py` utility and also check the taxon using an optional feature which compares the Average Nucleotide Identity to type assemblies. Note that this is the same process described in <https://github.com/ncbi/pgap/wiki/Quick-Start>

In [None]:
nohup ./pgap.py --taxcheck -o ${SRR}_results ${SRR}_input.yaml &
PID=$!
LOGFILE=${SRR}_results/cwltool.log
until [ -f $LOGFILE ] ; do sleep 2 ; done
tail --pid=$! -f $LOGFILE

# Cleanup
For convenience. (You probably don't want to use this.)

In [None]:
#rm -rf input* pgap_input* SRR* test_genomes* VERSION pgap.py
#docker system prune -a -f