Skip to content
Francoise Thibaud-Nissen edited this page May 23, 2023 · 23 revisions

Does Docker need to be installed on my machine?

Not necessarily. You need Docker or Docker-compatible software, such as Singularity or Podman. By default the pipeline runs in a Docker container. If you wish to use a compatible container technology instead, provide the path to pgap.py with the parameter --docker <path>. If, you wish to execute in Docker, follow the appropriate instructions at https://docs.docker.com/install/ for your distribution. Once it is installed, please ensure you can successfully execute the following command to test your configuration:

$ docker run hello-world 

What are the runtime resource requirements?

The amount varies based upon the size and complexity of the input genome. Typically we run on instances using 8 CPUs and 32 GB of memory to be safe. However you can get away with only 8 GB for smaller genomes. You may have to decrease the number of CPUs to fit your available memory. 2 GB to 4 GB per CPU is recommended.

Can I run under MacOS or Windows?

Although Linux is our primary development platform, we have run under other operating systems using pgap.py. You will still need Python 3.6 or greater, and Docker. Ensure that Docker is running with Linux support enabled. See this blog post for more information. Note that for both Windows and OSX you might need to increase the default amount of memory available to the Docker containers. Please see the runtime resource requirements above.

Do I need network access?

No. To run PGAP without accessing the network, make sure to set the flag --no-internet of pgap.py. It will disable internet access for all programs in the pipeline.

You will still need Internet access to install PGAP. One of the ways to do that is to install it on a common network area at internet-accessible host and use it on an airgapped host via setting up envar PGAP_INPUT_DIR:

export PGAP_INPUT_DIR=/network/accessible/directory
./pgap.py  --no-internet --no-self-update  ....

Will it work on any CPU?

No. The CPU must be an Intel compatible instruction set architecture which supports SSE4.2 or later. This includes most processors released after 2008 (see https://en.wikipedia.org/wiki/SSE4#SSE4.2).

Can I run PGAP in distributed compute clusters (UGE/SGE, SLURM, Biowulf)?

Maybe. While nothing in the software intentionally prevents use on a cluster, we cannot provide assistance for this use case, given the additional complexity. Feel free to try and tell us about your experience. Be aware that internet access is usually unavailable on clustered environments, so you may need to turn on the --no-internet option in pgap.py (see above).

Changing --cpu and --memory has no effect. What can I do?

This is mostly fixed in release 2021-01-11.build5132. The value provided to the --cpu parameter is now correctly passed to the container, whether Docker, Podman or Singularity. Memory limits (provided with --memory) are now supported for Docker and Podman, but not for Singularity.

Do you provide PGAP as Singularity images?

No. Unfortunately, we cannot provide direct Singularity support at this time, due to lack of expertise and competing priorities. We are able to host files (on FTP/HTTP, AWS S3, and GCS storage) and host Docker images on DockerHub. If the community is able to provide us instructions for supporting Singularity images under that constraint, we will be happy to follow up.

How long does PGAP take to run?

It largely depends on the size of the genome and the number of CPUs. In our hands, the Mycoplasma genitalium genome (0.58 Mb) distributed with the software takes about 1.5 hour, and a 5.7 Mb E.coli genome takes about 6.75 hours on a AWS m5.2xlarge instance (8 CPU and 32 GB RAM).

The annotation fails. What can I try?

If you haven't yet, test your installation on the Mycoplasmoides genitalium genome distributed with the software (MG37), to verify your platform is configured correctly. If this test doesn't succeed, try reinstalling fresh. Second, too little memory can lead to sporadic failures, so if you can, try running on a machine with more memory per CPU. As specified above, a minimum of 2 GB memory per CPU is recommended and will be sufficient for most genomes, but increasing the memory to 4 or 6 GB may circumvent failures.

Why does my run occasionally not finish, producing no logs or message in terminal, and yet the pipeline still seem to be running?

You are most likely running the pipeline on a remote machine over ssh, and the connection has been interrupted. Use the nohup utility, or a terminal multiplexer, such as tmux or screen when working on a remote machine, to allow pgap.py to continue in case the ssh connection is interrupted.

I am not confident in the taxonomic classification of the organism I sequenced, so the scientific name I can provide is only a guess. Is it acceptable?

Yes. Use the flags --taxcheck and --auto-correct-tax, so the process assigns the assembly to an organism prior to running PGAP. With --taxcheck, ANI will identify the best matching assembly in GenBank that is of well-defined origin. The scientific name you provided on input will be overriden by the scientific name determined by ANI, resulting in a more accurate annotation. The scientific name in the final results is the ANI-chosen name.

Can I run PGAP on a metagenomic sample?

No. PGAP runs on a single genome at a time. It uses the genus of the organism provided on input by the user to determine sets of proteins to align to the genome for gene prediction. The user is therefore required to associate a genus- or species-level organism name with the input FASTA.

What if PGAP fails in the validation of the input FASTA?

PGAP validates input FASTA files and fails by default if a genome contains vector or adaptor contamination, or is smaller or larger than expected for the species. For organisms for which no size range is defined, the minimum and maximum size allowed for the input genome are 15 Kb and 100 Mb respectively. You can choose to ignore the validation errors by setting the flag --ignore-all-errors in pgap.py. Keep in mind that the annotation obtained with the --ignore-all-errors flag may not comply with GenBank's standards of quality.

What information is reported to NCBI when I turn on the report usage flag (-r or --report-usage-true)?

For each run of the pipeline, two reports will be generated. One at the beginning, and one at the end. These reports help us measure our impact on the community, which in turns helps us get funds, so please report your usage. We collect:

  1. Date and time.
  2. A randomly generated UUID for each run.
  3. IP address.
  4. Pipeline version.

I need help diagnosing a failure. What files do I need to provide?

Please run PGAP with the --debug flag, open an issue and attach an archive (e.g. zip or tarball) of the logs in the directory: debug/tmp-outdir/*/*.log