Skip to content

Commit

Permalink
adding cctop eff score
Browse files Browse the repository at this point in the history
  • Loading branch information
maximilianh committed Oct 11, 2018
1 parent f72b1fa commit 92fceb0
Show file tree
Hide file tree
Showing 8 changed files with 937 additions and 6 deletions.
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -30,7 +30,7 @@ For the Cpf1 scoring model:

sudo pip install keras tensorflow h5py

Install required R libraries:
Install required R libraries for the WangSVM efficiency score:

sudo Rscript -e 'install.packages(c("e1071"), repos="http://cran.rstudio.com/")'
sudo Rscript -e 'source("https://bioconductor.org/biocLite.R"); biocLite(c("limma"));'
Expand Down
712 changes: 712 additions & 0 deletions bin/src/cctop_standalone/CCTop.py

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions bin/src/cctop_standalone/LICENSE
@@ -0,0 +1,22 @@

The MIT License (MIT)

Copyright (c) 2015 Juan L. Mateo

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
77 changes: 77 additions & 0 deletions bin/src/cctop_standalone/README
@@ -0,0 +1,77 @@
CCTop is a tool to determine suitable CRISPR/Cas9 target sites in a given query
sequence(s) and predict its potential off-target sites. The online version of
CCTop is available at http://crispr.cos.uni-heidelberg.de/

This is a command line version of CCTop that is designed mainly to allow search
of large volume of sequences and higher flexibility.

If you use this tool for your scientific work, please cite it as:
Stemmer, M., Thumberger, T., del Sol Keyer, M., Wittbrodt, J. and Mateo, J.L.
CCTop: an intuitive, flexible and reliable CRISPR/Cas9 target prediction tool.
PLOS ONE (2015). doi:10.1371/journal.pone.0124633


REQUIREMENTS

CCTop is implemented in Python and it requires a version 2.7 or above.

In addition we relay on the short read aligner Bowtie 1 to identify the
off-target sites. Bowtie can be downloaded from this site
http://bowtie-bio.sourceforge.net/index.shtml in binary format for the main
platforms.
You need to create an indexed version of the genome sequence of your
target species. This can be done with the tool bowtie-build included in the
Bowtie installation. For that you simply need a fasta file containing the genome
sequence. To get the index you can do something like:
$ <path-to-bowtie-folder>/bowtie-build -r -f <your-fasta-file> <index-name>

The previous line will create the index files in the current folder.

To handle gene and exon annotations we use the python library bx-python
(https://bitbucket.org/james_taylor/bx-python/). This library is only required
if you want to associate off-target sites with the closest exon/gene, otherwise
you don't need to install it. Notice, however, that in this case all candidate
target sites will be given the same score, because in the current version the
score of the candidate target sites considers only off-target sites with
associated exon/gene.

The exon and gene files contain basically the coordinates of those elements in
bed format (http://genome.ucsc.edu/FAQ/FAQformat#format1), which are the first
three columns of the file. There are two more columns with the ID and name of
the corresponding gene and a sixth empty column to comply with the format
accepted by the library. You can generate easily such kind of files for you
target organism using Ensembl Biomart (http://www.ensembl.org/biomart).

In case of difficulties with these files contact us and we can provide you the
ones you need or help you to generate you own.


INSTALLATION

Simply download the two .py files (CCTop.py and bedInterval.py) to a folder of
your choice and follow the instructions that you can find in the respective web
sites to install Bowtie 1 and bx-python.


USAGE

You can run CCTop with the -h flag to get a detailed list of the available
parameters. For instance:
$ python <download-folder>/CCTop.py -h<Enter>

At minimum it is necessary to specify the input (multi)fasta file (--input) and
the index file (--index). In this case CCTop assumes that the Bowtie executable
can be found in the current folder, there are not gene and exon file to use and
the rest of parameters will take default values. Notice that the index file to
specify refers to the name of the index you specified for bowtie-build together
with the path, if necessary. A command for a typical run will look something
like this:
$ python <download-folder>/CCTop.py --input <query.fasta> --index <path/index-name> --bowtie <path-to-bowtie> --output <output-folder> <Enter>

The result of the run will be three files for each sequence in the input query
file. These files will have extension .fasta, .bed and .xls, containing,
respectively, the sequence of the target sites, their coordinates and their
detailed information as in the online version of CCTop. The name of the output
file(s) will be taken from the name of the sequences in the input fasta file.


86 changes: 86 additions & 0 deletions bin/src/cctop_standalone/bedInterval.py
@@ -0,0 +1,86 @@
'''
Created on Mar 20, 2014
@author: juan
'''


class MyInterval(object):
def __init__(self,start,end,value):
self.start = start
self.end = end
self.value = value

class BedInterval( object ):
'''
classdocs
'''


def __init__(self):
'''
Constructor
'''
self.chroms = {}

def insert( self, chrom, start, end, gene_id, gene_name ):
from bx.intervals.intersection import Interval
from bx.intervals.intersection import IntervalNode
from bx.intervals.intersection import IntervalTree
if chrom in self.chroms:
self.chroms[chrom].insert( start, end, MyInterval(start,end,[gene_id, gene_name]) )
else:
self.chroms[chrom] = IntervalTree()
self.chroms[chrom].insert( start, end, MyInterval(start,end,[gene_id, gene_name]) )

def loadFile(self, file):
bedFile = open(file,'r')
for line in bedFile:
line = line.rstrip('\n').split('\t')
self.insert(line[0], int(line[1]), int(line[2]), line[3], line[4])

def overlaps(self, chrom, start, end):
if not chrom in self.chroms:
return False
overlapping = self.chroms[chrom].find(start, end)
if len(overlapping)>0:
return True
else:
return False

def closest(self,chrom, start, end):
if not chrom in self.chroms:
return ['NA', 'NA', 'NA']

#first checking if this site overlaps any from the loaded file
overlapping = self.chroms[chrom].find(start, end)
if len(overlapping)>0:
return [overlapping[0].value[0],overlapping[0].value[1],0]

#to the left
#it finds features with a start > than 'position'
left = self.chroms[chrom].before(start, max_dist=1e5)

#to the right
#it finds features with a end < than 'position'
right = self.chroms[chrom].after(end, max_dist=1e5)

if len(left)>0:
if len(right)>0:
distLeft = max(0,1 + start - left[0].end)
distRight = max(0,1 + right[0].start - end)
if distLeft < distRight:
return [left[0].value[0],left[0].value[1],distLeft]
else:
return [right[0].value[0],right[0].value[1],distRight]
else:
distLeft = max(0,1 + start - left[0].end)
return [left[0].value[0],left[0].value[1],distLeft]
else:
if len(right)>0:
distRight = max(0,1 + right[0].start - end)
return [right[0].value[0],right[0].value[1],distRight]
else:
return ['NA', 'NA', 'NA']


3 changes: 2 additions & 1 deletion crispor.py
Expand Up @@ -345,6 +345,7 @@
"fusi" : ("Doench '16", "Aka the 'Fusi-Score', since V4.4 using the version 'Azimuth', scores are slightly different than before April 2018 but very similar (click 'show all' to see the old scores). Range: 0-100. Boosted Regression Tree model, trained on data produced by Doench et al (881 guides, MOLM13/NB4/TF1 cells + unpublished additional data). Delivery: lentivirus. See <a target='_blank' href='http://biorxiv.org/content/early/2015/06/26/021568'>Fusi et al. 2015</a> and <a target='_blank' href='http://www.nature.com/nbt/journal/v34/n2/full/nbt.3437.html'>Doench et al. 2016</a> and <a target=_blank href='https://crispr.ml/'>crispr.ml</a>. Recommended for guides expressed in cells (U6 promoter). Click to sort the table by this score."),
"fusiOld" : ("Doench '16-Old", "The original implementation of the Doench 2016 score, as received from John Doench. The scores are similar, but not exactly identical to the 'Azimuth' version of the Doench 2016 model that is currently the default on this site, since Apr 2018."),
"najm" : ("Najm 2018", "A modified version of the Doench 2016 score ('Azimuth'), by Mudra Hegde for S. aureus Cas9. Range 0-100. See <a target=_blank href='https://www.nature.com/articles/nbt.4048'>Najm et al 2018</a>."),
"ccTop" : ("CCTop", "The efficiency score used by CCTop, called 'crisprRank'."),
"aziInVitro" : ("Azimuth in-vitro", "The Doench 2016 model trained on the Moreno-Mateos zebrafish data. Unpublished model, gratefully provided by J. Listgarden"),
"housden" : ("Housden", "Range: ~ 1-10. Weight matrix model trained on data from Drosophila mRNA injections. See <a target='_blank' href='http://stke.sciencemag.org/content/8/393/rs9.long'>Housden et al.</a>"),
"proxGc" : ("ProxGCCount", "Number of GCs in the last 4pb before the PAM"),
Expand Down Expand Up @@ -1391,7 +1392,7 @@ def makePosList(org, countDict, guideSeq, pam, inputPos):
guideNoPam = "A"+guideNoPam

if pamIsCpf1(pam):
# Cpf1 has no scores yet
# Cpf1 has no off-target scores yet
mitScore=0.0
cfdScore=0.0
else:
Expand Down
17 changes: 16 additions & 1 deletion crisporEffScores.py
Expand Up @@ -8,7 +8,9 @@
# - Fusi: Fusi et al, prepublication manuscript on bioarxiv, http://dx.doi.org/10.1101/021568 http://research.microsoft.com/en-us/projects/azimuth/, only available as a web API
# - Housden: Housden et al, PMID 26350902, http://www.flyrnai.org/evaluateCrispr/
# - OOF: Microhomology and out-of-frame score from Bae et al, Nat Biotech 2014 , PMID24972169 http://www.rgenome.net/mich-calculator/
# - Wu-Crispr: Wong et al, PMID, http://www.genomebiology.com/2015/16/1/218
# - Wu-Crispr: Wong et al, http://www.genomebiology.com/2015/16/1/218
# - DeepCpf1, Kim et al, PMID 29431740, https://www.ncbi.nlm.nih.gov/pubmed/29431740
# - SaCas9 efficiency score (no name), Najm et al, https://www.ncbi.nlm.nih.gov/pubmed/29251726

# the input are 100bp sequences that flank the basepair just 5' of the PAM +/-50bp.
# so 50bp 5' of the PAM, and 47bp 3' of the PAM -> 100bp
Expand Down Expand Up @@ -36,6 +38,9 @@
najm2018Dir = join(dirname(__file__), "bin/najm2018/")
sys.path.insert(0, najm2018Dir)

cctopDir = join(dirname(__file__), "bin/src/cctop_standalone")
sys.path.insert(0, cctopDir)

# import numpy as np

# global that points to the crispor 'bin' directory with the external executables
Expand Down Expand Up @@ -822,6 +827,8 @@ def calcAllScores(seqs, addOpt=[], doAll=False, skipScores=[], enzyme=None):
logging.debug("Azimuth in-vitro")
scores["aziInVitro"] = calcAziInVitro(trimSeqs(seqs, -24, 6))

scores["ccTop"] = calcCctopScore(trimSeqs(seqs, -20, 0))

scores["finalGc6"] = [int(s.count("G")+s.count("C") >= 4) for s in trimSeqs(seqs, -6, 0)]
scores["finalGg"] = [int(s=="GG") for s in trimSeqs(seqs, -2, 0)]

Expand Down Expand Up @@ -1110,6 +1117,14 @@ def calcDeepCpf1Scores(seqs):
import DeepCpf1
return DeepCpf1.scoreSeqs(seqs)

def calcCctopScore(seqs):
import CCTop
scores = []
for seq in seqs:
score = CCTop.getScore(seq)
scores.append(score)
return scores

# ----------- MAIN --------------
if __name__=="__main__":
args, options = parseArgs()
Expand Down
24 changes: 21 additions & 3 deletions todo.txt
Expand Up @@ -21,8 +21,6 @@ Add these new criteria:
- Jennifer's zebrafish model
- CRISPROff and CRISPRSpec? http://rth.dk/resources/crispr/

Show distance of off-targets from the on-target!

Show a histogram of the mismatches instead of 0-0-10-15-40

What is site-seq and circle-seq?
Expand Down Expand Up @@ -55,9 +53,25 @@ off-target searchers:
- align against ALL human genomes: https://github.com/emptyewer/MAPster
- Church lab (CRISPR-GA): http://54.80.152.219/index.php

- more scores:
- on-target scores:
- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1697-6
Says that dinucleotides improve the AUC, code at:
http://www.ams.sunysb.edu/~pfkuan/softwares.html#predictsgrna
Good R code, easy to run
- CrisprPred http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0181943
code at: https://github.com/khaled-rahman/CRISPRpred
Weird R code, nothing easy to calculate the score, looks like CS people
No updates ever
- CCTop crisprRater https://academic.oup.com/nar/article/46/3/1375/4754467

off-targets:
- OfftargetPredict https://academic.oup.com/bioinformatics/article/34/17/i757/5093213
code at https://github.com/penn-hui/OfftargetPredict
- crisproff, unpublished but on github


induced deletions:
- see JP email, CSHL 2018

---

Expand Down Expand Up @@ -164,3 +178,7 @@ of good guides in the first and second exon of a gene, for creating knockouts.
Thanks again, this will be super helpful for finding guides for our next few
projects
Zach

Done:

Show distance of off-targets from the on-target!

0 comments on commit 92fceb0

Please sign in to comment.