Skip to content
visualize (clustered) single-nucleotide variants across genomes
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
avedata
docs
scripts
tests
.gitignore
.travis.yml
.zenodo.json
CITATION.cff
Dockerfile
LICENSE
MANIFEST.in
README.md
environment.yml
features.csv
gff3.as
meta.csv
nginx.conf
preprocessing_notes.md
settings.docker.cfg
settings.example.cfg
setup.cfg
setup.py
sonar-project.properties

README.md

ave-rest-service

Build Status SonarCloud Gate SonarCloud Coverage Docker Automated build DOI

The Allelic Variation Explorer (AVE) is a web application to visualize (clustered) single-nucleotide variants across genomes. The Allelic Variation Explorer rest service clusters genomic variants and lists the available datasets. Combined with ave-app will visualize clustered genomic variants for a certain genomic range in a genome browser.

Screenshot of Allelic Variation Explorer

This service is the back end for the ave-app front end. The front end runs in the users web browser and communicates with the back end running on a web server somewhere. The front end is the user interface and the back end is the service serving the variant, annotation and genomic data. The front end and back end communicate with each other according to the Swagger specification in swagger.yml.

Architecture

Architecture

The ave-rest-service and ave-app are wrapped up in a Docker image. The Docker image is used for deploying the Allelic Variation Explorer on a server.

The Allelic Variation Explorer consists of the following parts working together:

  • a running ave rest service
  • an extracted ave-app build archive.
  • 2bit (genome sequence), bcf (variants) and bigbed (genes and feature annotations) data files, green in diagram
  • a directory with full text indices for genes and features in Whoosh format, filled by data registration commands, red in diagram
  • an meta database file, contains list of available datasets inside the application, filled by data registration commands, yellow in diagram
  • a NGINX web server, for hosting app and data files and proxy-ing ave rest service behind a single port
  • a Docker image combining all above, see ./Dockerfile for the instructions used to install all the parts

Deployment

A Docker image is available on Docker Hub.

Any change to the master branch of this repo or the ave-app will trigger an automatic build of the Docker image.

The Docker image contains no data and when data is added then the data will be lost when the Docker container is stopped/started. To get a deployment which is persists it's data we will use directories on the server and mount these as volumes in the Docker container. It expects the following volumes:

  • /data, location for 2bit, bcf and bigbed data files. Hosted as http://<aveserver>/data
  • /whoosh, full text indices for genes and features
  • /meta, directory in which ave meta database is stored

Run the service with

# Use sub directories in the current working directory to persist data
mkdir data
mkdir whoosh
mkdir meta
docker run -d \
  -v $PWD/data:/data -v $PWD/whoosh:/whoosh -v $PWD/meta:/meta \
  -p 80:80 \
  --name ave ave2/allelic-variation-explorer

Command above will run web server on port 80 of host machine.

After deployment the server is running, but contains no data, see Data pre processing chapter how to prepare data followed by the Data registration chapter how to add data.

Demo

A demo Docker image with a sample dataset is available at https://hub.docker.com/r/ave2/ave-demo/ .

Secure connection

The Docker container uses http which is an unencryted connection. To use a secure connection (https), a reverse proxy with Let's Encrypt certificate can be put in front of the Docker container.

The Docker container must be run a port which is not 443.

Configure a web server like NGINX to server https on port 443 and proxy all requests to the container. Use Certbot to generate the certificate pair and configure the web server.

See example server conf in commented out block in ./nginx.conf file.

Shutting down

The Docker container can be stopped using

docker rm -f ave

Update image

Make sure the Docker container is not running.

The Docker image can be updated using

docker pull ave2/allelic-variation-explorer

Data pre processing

Before data can be registered it has to be converted in the right format. Below describes pre processing steps to convert common formats into the formats the application expects.

The tools used are available inside the Docker container or can be installed in an Anaconda environment. To perform the pre processing inside the Docker container, copy the raw files to the /data Docker volume and login to the Docker container with docker exec -ti ave bash.

Genome sequence

Genome sequence in 2bit format is used.

When you have a genome sequence in FASTA format, where each chromosome is a sequence in the file.

The FASTA file can be converted to 2bit using:

faToTwoBit genome.fa genome.2bit

The sequence identifiers (chromosome/contig) should match the ones in corresponding variants (bcf) genes (bigbed) and genomic feature annotations (bigbed) files.

Variants

Variants need to be provided in single BCF formatted file.

VCF files can be converted to a BCF file in the following way:

  1. Merge VCF files with SAMtools and VCFtools
# vcf-merge requires bgzipped and tabix indexed VCF files so do that first
for f in $(ls *.vcf)
do
bgzip $f
tabix -p vcf $f.gz
done
vcf-merge *.vcf.gz > variants.vcf
  1. Sort by chromosome with VCFtools
vcf-sort -c variants.vcf > variants.sorted.vcf
  1. Compress by bgzip and index with tabix
bgzip variants.sorted.vcf
tabix -p vcf variants.sorted.vcf.gz
  1. Convert to BCF with bcftools
bcftools view -O b variants.sorted.vcf.gz > variants.sorted.bcf
  1. Index with bcftools
bcftools index variants.sorted.bcf

Genes

The genes (or transcripts) are rendered in a gene track.

Genes must be provided as bigBed formatted. A bigBed file can be converted from a BED formatted file

The gene bed file is expected to have the following columns:

  1. chrom, name of chromosome
  2. chromStart, start of gene, zero-indexed
  3. chromStart, end of gene, zero-indexed
  4. name, transcript identifier
  5. score
  6. strand
  7. thickStart, location of start codon
  8. thinkEnd, location of stop codon
  9. itemRgb
  10. blockCount, number of exons
  11. blockSizes
  12. blockStarts
  13. gene identifier
  14. description of gene

So it should have exons and start/stop codons for one gene on a single line.

Some gene bed files are available at http://bioviz.org/quickload/

To convert a gene bed file to bigbed use:

# Fetch chrom sizes
twoBitInfo genome.2bit chrom.sizes
# the version for mac os is available also available http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/bedToBigBed
chmod +x bedToBigBed
# description field is too long for bedToBigBed so it must be trimmed
gunzip -c S_lycopersicum_May_2012.bed.gz | perl -n -e 'chomp;@F=split(/\t/);$F[13] = substr($F[13],0,255); print join("\t", @F),"\n";'  > S_lycopersicum_May_2012.bed.trimmed
bedToBigBed -tab -type=bed12+2 S_lycopersicum_May_2012.bed.trimmed chrom.sizes S_lycopersicum_May_2012.bb

A BigBed file with more than 256 chromosomes will not render, see issue 37.

Genomic features annotations

Each feature file will render a feature track. The feature track name is the same as the file name.

Feature annotations must be provided as bigBed formatted files.

To convert a gff feature file to bigbed use:

# Download a gff file
wget ftp://ftp.solgenomics.net/tomato_genome/microarrays_mapping/A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.gff
# Sort gff
bedtools sort -i A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.gff > A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.sorted.gff
# Convert gff to bed
gff2bed < A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.sorted.gff > A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.bed
# Fetch chrom sizes
twoBitInfo genome.2bit chrom.sizes
# Convert bed to bigbed
bedToBigBed -tab -type=bed6+4 -as=gff3.as A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.bed chrom.sizes A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.bb

The gff3.as (in Docker container available as /app/gff3.as) is used to describe the columns in the bed file.

A BigBed file with more than 256 chromosomes will not render, see issue 37.

Data registration

After deployment the server is running, but contains no data. Data files must be registered so they show up in the web browser.

The following commands expect a running Docker container as described in the Deployment chapter.

Register genome

docker exec ave \
    avedata register \
    --species 'Solanum Lycopersicum' \
    --genome SL.2.40 \
    --datatype 2bit \
    /data/tomato/SL.2.40/genome.2bit

The last argument is the location of the genome 2bit file (/data/tomato/SL.2.40/genome.2bit in this example), it must be an absolute path which starts with /data/ must be readable by anyone inside the Docker container.

The ave-app front end will use this path as the relative http(s) path to fetch reference genome sequence in the selected region and the AVE rest service will use it to determine the chromosome list and build haplotype sequence.

Register variants

docker exec ave \
    avedata register \
    --species 'Solanum Lycopersicum' \
    --genome SL.2.40 \
    --datatype variants \
    /data/tomato/SL.2.40/tomato_snps.bcf

The /data/tomato/SL.2.40/tomato_snps.bcf file must be readable by anyone inside the Docker container.

To perform clustering a registered genome (2bit) file and corresponding variant (bcf) file is required.

Register genes

One genome can have one gene track. The gene track shows exons, introns and untranslated regions.

docker exec ave \
    avedata register \
    --species 'Solanum Lycopersicum' \
    --genome SL.2.40 \
    --datatype genes \
    /data/tomato/SL.2.40/gene_models.bb

The last argument is the bigbed formatted file with genes (/data/tomato/SL.2.40/gene_models.bb in this example), it must be an absolute path which starts with /data/ and must be readable by anyone inside the Docker container.

Registration can take some time because a Whoosh full text index is build.

The ave-app front end will use this path as the relative http(s) path to fetch the genes in the selected region.

Register feature annotations

One genome can have one ore more feature annotation files registered.

docker exec ave \
    avedata register \
    --species 'Solanum Lycopersicum' \
    --genome SL.2.40 \
    --datatype features \
    /data/tomato/SL.2.40/A-AFFY-87.bb

The last argument is the bigbed formatted file with feature annotations (/data/tomato/SL.2.40/A-AFFY-87.bb in this example), it must be an absolute path which starts with /data/ and must be readable by anyone inside the Docker container.

Registration can take some time because a Whoosh full text index is build.

The ave-app front end will use this path as the relative http(s) path to fetch the feature annotations in the selected region. The basename of the file, in this case A-AFFY-87, will be used as the track label.

Deregister

To deregister one of the files use the deregister command.

For example to deregister the /data/tomato/SL.2.40/A-AFFY-87.bb file use:

docker exec ave \
    avedata deregister \
    /data/tomato/SL.2.40/A-AFFY-87.bb

Develop

Below are instructions how to get a development version of the service up and running.

Requirements:

Setup

First clone the repository

git clone https://github.com/nlesc-ave/ave-rest-service.git
cd ave-rest-service

To create a new Anaconda environment with all the ave dependencies installed.

conda env create -f environment.yml

On osx use enviroment.osx.yml instead of environment.yml.

Activate the environment

source activate ave2

Install ave for development with

python setup.py develop

If dependencies are changed in environment.yml then update conda env by runnning

conda env update -f environment.yml

Configuration

The service needs a configuration file called setting.cfg.

The repo contains an example config file called settings.example.cfg. Copy the example config file to settings.cfg and edit it.

Make sure the WHOOSH_BASE_DIR directory exists.

Run service

Change to directory with settings.cfg file.

gunicorn -w 4 --threads 2 -t 60 -b 127.0.0.1:8080 avedata.app:app

It will run the service on http://127.0.0.1:8080/ . The api endpoint is at http://127.0.0.1:8080/api/. The Swagger UI is at http://127.0.0.1:8080/api/ui.

This will only run the Python web service, the hosting of the application and data files is explained in the deployment chapter.

Run commands

The avedata command line tool expects to be run from the directory containing the settings.cfg file.

Available commands:

  • deregister Remove a filename or url from the database
  • drop_db Drops database
  • init_db Initializes database
  • list List of registered files/urls in the database
  • register Add file metadata information to the database
  • run Run as single threaded web service

Build Docker image

Building the Docker image of the latest commit of the master branch is automatically build on Docker hub with the latest tag.

If you have local changes and want to test the Docker container locally then you can build the Docker image with

docker build -t ave2/allelic-variation-explorer .

The Docker image contains the latest version of ave-app. If you want to run a different version of the app you will need to replace curl/tar command in ./Dockerfile with commands to put the app you want in /var/www/html directory and build a Docker image.

You can’t perform that action at this time.