Sequence variation data of the human proteome can be used to analyze 3-dimensional (3D) protein structures to derive functional insights. This program maps observed missense variation data to 3D structures, uses an underlying genetic model to estimate expectation, and produces 3D Tolerance Scores (3DTS) for genetic loci in 3D protein space. Details of the model are available through this code and are described in detail in the manuscript listed below (see Citation).
- Launch an r4.4xlarge EC2 instance (Amazon Linux AMI) with a 600GB EBS volume.
- Download install.then.run.sh and run from the EBS volume. Please note that EC2/ftp issues may require manual download of files from GenCode (see Comments in install.then.run.sh).
./install.then.run.sh > log
After the script finishes (~10 hours), 3DTS scores will have been produced using:
- Genomic annotations from GenCode.v26lift37
- Genomic variation using gnomAD exomes and genomes data
- Protein structure information from the PDB
- Feature information from the current release of UniprotKB
The most relevant outputs will be the 3DTS scores and 3DTS feature definitions (filenames may vary slightly):
3DTS scores: 3DTS/data/depletion3-7/full.gencode.v26lift37.annotation.gtf.gz.genome.json.gz.variationdata.json.gz.5.0.-52800447..json.gz.gencode.v26lift37.annotation.gtf.gz.genome.json.gz.-1067519786.json.gz
3DTS feature defintions: 3DTS/data/structuralcontextfromfeatures/5.0.-52800447..json.gz
For interactive queries, a server on port 8080 is launched after completion of the script, allowing users to query and visualize specific structures of interest. The color scheme ranges from red (intolerant) to white to blue (tolerant). Structures are displayed only for PDB-based queries; for any other query type, 3DTS information is shown in a table below. Currently only proteinogenic atoms are displayed.
The code has been optimized for producing structural proteome-wide scores. Individual proteins can then be queried through the output files or via the Web-server on port 8080 (see Output section above).
A shell script has been added to this repo to allow for individual queries using a UniprotKB ID with known X-ray structures by editing the variable uniprotofinterest (see single_protein_query.sh). The tested query takes ~1 hour.
./single_protein_query.sh > log
- Copy the jdistlib distribution jar to lib/ from http://jdistlib.sourceforge.net/
- Install sbt (e.g.
brew install sbt
) - Publish to the local ivy repository all projects in the
dependencies/
folder
for i in $(ls dependencies/); do cd dependencies/$i && sbt publishLocal && cd ../../; done
- From the root folder:
sbt packageZipTarball
- Move the packaged application located in
target/universal/saturation-0.1-SNAPSHOT.tgz
to a machine with 120G RAM and ~600Gb disk. (e.g., EC2 instance type r4.4xlarge) - Edit the configuration file (see Configuration section below)
- Unzip the packaged application and run with
bin/saturation -Dconfig.file=path/to/config -J-Xmx115G -Djava.io.tmpdir=path/to/tmp
akka.http.client.parsing.max-content-length = infinite
akka.http.host-connection-pool.client.idle-timeout = infinite
# s3 or filesystem path where result and intermediate files will be written
tasks.fileservice.storageURI =
hosts.RAM=120000
hosts.numCPU=16 # or whatever is convenient
The values of the keys should be http, https or s3 URLs (s3 recommended).
uniprotKb = uniprot-all.txt.gz // Compressed Text formatted file for SwissProt (Reviewed) portion of Uniprot for Proteome UP000005640
gencodeGTF = gencode.v26lift37.annotation.gtf.gz // Comprehensive gene annotation for GRCh37
gencodeTranscripts = gencode.v26lift37.pc_transcripts.fa.gz // Protein-coding transcript sequences for GRCh37
gencodeMetadataXrefUniprot = gencode.v26lift37.metadata.SwissProt.gz // Cross-reference between GRCh37 and SwissProt
gnomadGenome = gnomad.genomes.r2.0.1.sites.coding.autosomes.vcf.gz // gnomAD Genome variants
gnomadExome = gnomad.exomes.r2.0.1.sites.vcf.gz // gnomAD Exome variants
# genome coverage files from the Gnomad browser should be concatenated
# and header line removed
# e.g. `for i in $(tar tf genome.coverage.all.tar | grep -v tbi ) ; do tar xOf genome.coverage.all.tar $i ; done | gunzip -c | grep -v '#' > genome.coverage.concat.txt`
gnomadExomeCoverage = exome.coverage.concat.txt // gnomAD Exome Coverage
gnomadGenomeCoverage = genome.coverage.concat.txt // gnomAD Genome Coverage
https://github.com/typesafehub/config
Manuscript under consideration. Submission available on bioRxiv: http://www.biorxiv.org/content/early/2017/08/29/182287
The 3DTS Software Code (the "Code") is made available by Human Longevity, Inc. ("HLI") on a non-exclusive, non-sublicensable, non-transferable basis solely for non-commercial academic research use. Commercial use of the Code is expressly prohibited. If you would like to obtain a license to the Code for commercial use, please contact HLI at bizdev@humanlongevity.com. HLI MAKES NO REPRESENTATIONS OR WARRANTIES WHATSOEVER, EITHER EXPRESS OR IMPLIED, WITH RESPECT TO THE CODE PROVIDED HEREUNDER. IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO CODE ARE EXPRESSLY DISCLAIMED. THE CODE IS FURNISHED "AS IS" AND "WITH ALL FAULTS" AND DOWNLOADING OR USING THE CODE IS UNDERTAKEN AT YOUR OWN RISK. TO THE FULLEST EXTENT ALLOWED BY APPLICABLE LAW, IN NO EVENT SHALL HLI BE LIABLE, WHETHER IN CONTRACT, TORT, WARRANTY, OR UNDER ANY STATUTE OR ON ANY OTHER BASIS FOR SPECIAL, INCIDENTAL, INDIRECT, PUNITIVE, MULTIPLE OR CONSEQUENTIAL DAMAGES SUSTAINED BY YOU OR ANY OTHER PERSON OR ENTITY ON ACCOUNT OF USE OR POSSESSION OF THE CODE, WHETHER OR NOT FORESEEABLE AND WHETHER OR NOT HLI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES, INCLUDING WITHOUT LIMITATION DAMAGES ARISING FROM OR RELATED TO LOSS OF USE, LOSS OF DATA, DOWNTIME, OR FOR LOSS OF REVENUE, PROFITS, GOODWILL, BUSINESS OR OTHER FINANCIAL LOSS.