- Python 3.7 and standard packages (pickle, scipy, numpy, pandas)
The models were built with pytorch==1.1.0 and torchvision==0.2.2.
Clone this github repository, then set up your environment to import the
predict.py script in however is most convenient for you. In python, for instance, you may use the following at the top of your script to import the model.
import sys sys.path.append('/directory/containing/local/repo/clone/') from be_predict_bystander import predict as bystander_model
from be_predict_bystander import predict as bystander_model bystander_model.init_model(base_editor = 'BE4', celltype = 'mES')
Note: Supported cell types are
['mES', 'HEK293'] and supported base editors are
['ABE', 'ABE-CP1040', 'BE4', 'BE4-CP1028', 'AID', 'CDA', 'eA3A', 'evoAPOBEC', 'eA3A-T44DS45A', 'BE4-H47ES48A', 'eA3A-T31A', 'eA3A-T31AT44A', 'BE4-H47ES48A']. Not all combinations of base editors and cell types are supported -- refer to
Available C-to-G base editors (CGBEs):
['CG-eA3A', 'CG-689', 'CG-APOBEC1', 'CG-POLD2-APOBEC1-X', 'CG-RBMX-eA3A-X-HF', 'CG-RBMX-eA3A-X', 'CG-X-689-X-RBMX', 'CG-X-APOBEC1-X-HF', 'CG-X-EE-X-X', 'CG-eA3A-dead', 'CG-EE'].
If your cell type of interest is not included here, we recommend using mES. Major base editing outcomes are fairly consistent across cell-types, though rarer outcomes including cytosine transversions are known to depend on cell-type to some extent.
pred_df, stats = bystander_model.predict(seq)
seq is a 50-nt string of DNA characters, spanning from positions -19 to 30 where positions 1-20 are the spacer, an NGG PAM occurs at positions 21-23, and position 0 is used to refer to the position directly upstream of position 1.
pred_df is a pandas dataframe containing a row for each unique combination of base editing outcomes. The column 'Predicted frequency' sums to one.
stats is a dict with the following keys.
- Total predicted probability
- 50-nt target sequence
- Assumed protospacer sequence
- Base editor
from be_predict_bystander import predict as bystander_model bystander_model.init_model(base_editor = 'BE4', celltype = 'mES') seq = 'TATCAGCGGGAATTCAAGCGCACCAGCCAGAGGTGTACCGTGGACGTGAG' pred_df, stats = bystander_model.predict(seq)
Additional methods and advanced topics
Once you have obtained
pred_df, stats, additional methods are available for your convenience.
Obtaining exact genotypes
pred_df, stats = bystander_model.predict(seq, cutsite) pred_df = bystander_model.add_genotype_column(pred_df, stats)
A new column
Genotype will be created.
Increasing total predicted probability
This tool outputs predictions on the combinatorial space of size 4^N where N is the number of substrate nucleotides (A or C for ABEs, and C or G for CBEs) in the editing windows defined in
editor_profiles.csv. To maximize utility, we use a heuristic search designed to cover the vast majority of total probability while querying a small fraction of all possible combinations of edits. We anticipate that our heuristic strategy will be sufficient for most users. However, if you'd like to change this behavior, you can edit the code in
predict.py -- the private function
__seq_to_query_df is a good place to start.
maxwshen at mit.edu