In [1]:
import pandas as pd
import gzip
import numpy as np

Identifying the percentage of the DeWeirdt 2022 tiling screen (nonessential *and* essential targeting guides) that is excluded by the filter for influential off-target sites implemented by Guidescan2 vs. what is implemented in CRISPick/ for the design of Jacquere 


**Guidescan2 influential off-target site filter** 

Due to excessive risk of off-target activtiy, Guidescan2 does not assign a specificity score (thus imposing absolute exclusion) to guides with:

- any OTS with 0 mismatches to target site

- more than 1 OTS with 1 mismatch to target site 

and includes no OTS with an NGG PAM in this search. Mismatch # includes any mismatch in entire 20mer. 



The DeWeirdt 2022 Tiling library (essentials and nonessentials) was supplied as input to the [Guidescan2 gRNA design tool](https://guidescan.com/) to acquire specificity scores

In [2]:
guidescan_batch1=pd.read_csv("../Data/rs3val_ess_noness_guidescan_batch1.csv")
guidescan_batch2=pd.read_csv("../Data/rs3val_ess_noness_guidescan_batch2.csv")
guidescan_batch3=pd.read_csv("../Data/rs3val_ess_noness_guidescan_batch3.csv")

guidescan=pd.concat([guidescan_batch1,guidescan_batch2]).reset_index(drop=True)
guidescan=pd.concat([guidescan,guidescan_batch3]).reset_index(drop=True)


Guides that have either 1) any OTS with 0 mismatches to target site or 2) more than 1 OTS with 1 mismatch to target site yield an error message in the "Chromosome" column of guidescan2 output. Thus, the % of guides with no target chromosome provided resemble the % of guides that eliminated from library design by Guidescan2 on the basis of these two criteria

In [3]:
print("% of DeWeirdt 2022 tiling library exceeding Guidescan2 influential off-target site filter:", round(100*len(guidescan[guidescan["Chromosome"].str.count("chr")==0])/len(guidescan),2))


% of DeWeirdt 2022 tiling library exceeding Guidescan2 influential off-target site filter: 30.08


**CRISPick/ Jacquere influential off-target site filter**

We preferentially select guides with no CFD=1.0 OTS with fewer than 2MM, relaxing this restriction as needed to guides with such OTS in Tier I (protein coding) regions. 

This criteria can be violated if necessary to meet the quota for a gene, in which case the Aggregate CFD score stands to exclude guides with excessive off-target activity. 

To identify the number of OTS this excludes, we ran the DeWeirdt 2022 Validation Tiling Dataset (including essentials and nonessentials) through a developmental version of CRISPick that details each individual off-target site of each guide.

In [4]:
#essentials verbose batch 1
with gzip.open('../Data/rs3validation_essentials_batch1_verbose_1_22_2024-sgrna-designs.offtargetdisco.txt.gz') as f:
    rs3val_ess_verbose_batch1 = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])
#essentials verbose batch 2
with gzip.open('../Data/rs3val_essentials_batch2_verbose_1_22_2024-sgrna-designs.offtargetdisco.txt.gz') as f:
    rs3val_ess_verbose_batch2 = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])
#nonessentials verbose
with gzip.open('../Data/sgrna-designs.targetdisco.txt.gz') as f:
    rs3val_noness_verbose = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])

#join both essential batches
rs3val_ess_verbose=pd.concat([rs3val_ess_verbose_batch1,rs3val_ess_verbose_batch2]).reset_index(drop=True)    
#join essentials and nonessentials
rs3val_verbose=pd.concat([rs3val_ess_verbose,rs3val_noness_verbose])

  rs3val_ess_verbose_batch1 = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])
  rs3val_ess_verbose_batch1 = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])
  rs3val_ess_verbose_batch2 = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])
  rs3val_ess_verbose_batch2 = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])
  rs3val_noness_verbose = pd.read_table(f,index_col=False,header=None,names=["Target","Target Sequence","Context Sequence","Match Tier", "CFD Score","PAM Sequence", "# of mismatches"])


In [5]:
#drop data from sgRNAs with >10,000 OTS 
rs3val_verbose=rs3val_verbose[pd.to_numeric(rs3val_verbose["# of mismatches"], errors='coerce').notnull()].reset_index(drop=True).copy()
rs3val_verbose["# of mismatches"]=rs3val_verbose["# of mismatches"].astype(np.int64)
rs3val_verbose["CFD Score"]=rs3val_verbose["CFD Score"].astype(np.float64)

rs3val_verbose["cfd_1.0"]= np.where(rs3val_verbose["CFD Score"]==1, 1,0)
rs3val_verbose["cfd_1.0_below2mm"]= np.where((rs3val_verbose["# of mismatches"]<2)&(rs3val_verbose["CFD Score"]==1), 1,0)
rs3val_verbose["cfd_1.0_below2mm_proteincoding"]= np.where((rs3val_verbose["# of mismatches"]<2)&(rs3val_verbose["Match Tier"]=="Tier I")&(rs3val_verbose["CFD Score"]==1), 1,0)
rs3val_verbose

Unnamed: 0,Target,Target Sequence,Context Sequence,Match Tier,CFD Score,PAM Sequence,# of mismatches,cfd_1.0,cfd_1.0_below2mm,cfd_1.0_below2mm_proteincoding
0,ENSG00000127837 (AAMP),AAGCTTAGGGTCTCCAGTGG,ATGGAAGCTTAGGGTCTCCAGTGGGGGGGT,Tier II,0.277333,NGG,3,0,0,0
1,ENSG00000127837 (AAMP),AAGCTTAGGGTCTCCAGTGG,ATGGAAGCTTAGGGTCTCCAGTGGGGGGGT,Tier IV,0.727273,NGG,3,0,0,0
2,ENSG00000127837 (AAMP),AAGCTTAGGGTCTCCAGTGG,ATGGAAGCTTAGGGTCTCCAGTGGGGGGGT,Tier II,0.350000,NGG,3,0,0,0
3,ENSG00000127837 (AAMP),AAGCTTAGGGTCTCCAGTGG,ATGGAAGCTTAGGGTCTCCAGTGGGGGGGT,Tier IV,0.540388,NGG,3,0,0,0
4,ENSG00000127837 (AAMP),AAGCTTAGGGTCTCCAGTGG,ATGGAAGCTTAGGGTCTCCAGTGGGGGGGT,Tier IV,0.397727,NGG,3,0,0,0
...,...,...,...,...,...,...,...,...,...,...
192826469,54714 (CNGB3),TGATTGACGAGAAGTCCCTC,TGAGTGATTGACGAGAAGTCCCTCTGGGTA,Tier IV,0.482143,NGG,2,0,0,0
192826470,54714 (CNGB3),TGATTGACGAGAAGTCCCTC,TGAGTGATTGACGAGAAGTCCCTCTGGGTA,Tier IV,0.143590,NGG,2,0,0,0
192826471,54714 (CNGB3),TGATTGACGAGAAGTCCCTC,TGAGTGATTGACGAGAAGTCCCTCTGGGTA,Tier II,0.197802,NGG,2,0,0,0
192826472,54714 (CNGB3),TGATTGACGAGAAGTCCCTC,TGAGTGATTGACGAGAAGTCCCTCTGGGTA,Tier IV,0.301961,NGG,2,0,0,0


In [6]:
#aggregate by guide
rs3val_agg = (rs3val_verbose.groupby(["Target","Target Sequence"])
            .agg(num_cfd100_sites= ("cfd_1.0_below2mm","sum"),
                 num_cfd100_tierI_sites= ("cfd_1.0_below2mm_proteincoding","sum"),# left off here 
                )
            .reset_index())
rs3val_agg

Unnamed: 0,Target,Target Sequence,num_cfd100_sites,num_cfd100_tierI_sites
0,100132476 (KRTAP4-7),AAGAGGAGGCACAGCACAAG,4,4
1,100132476 (KRTAP4-7),AAGTGGTGTGGCAGGAGACT,6,6
2,100132476 (KRTAP4-7),ACAACAGCTGGGGCGGTAGC,1,1
3,100132476 (KRTAP4-7),ACAGCACAAGGGGCGGGGAC,0,0
4,100132476 (KRTAP4-7),ACAGCAGCTGGGGCGGCAGC,21,21
...,...,...,...,...
191958,ENSG00000284770 (TBCE),TTCTTTGTCCTCTTTGGTCA,1,1
191959,ENSG00000284770 (TBCE),TTGCTTTCTTAAACAGAATA,2,1
191960,ENSG00000284770 (TBCE),TTTACCAGGCACCCGACAGG,1,1
191961,ENSG00000284770 (TBCE),TTTGCTGGGCAACTTACTTT,2,1


Determining percent of Deweirdt 2022 tiling that violates both the strict and relaxed influential OTS criteria in CRISPick/ Jacquere. Once again, unlike Guidescan2, such guides are not subject to absolute exclusion but are rather avoided when possible. 

In [10]:
print("% of guides in entire tiling set that would be excluded by strict filter:",100*len(rs3val_agg[rs3val_agg["num_cfd100_sites"]>0])/len(rs3val_agg))
print("% excluded by relaxed filter:",100*len(rs3val_agg[rs3val_agg["num_cfd100_tierI_sites"]>0])/len(rs3val_agg))

% of guides in entire tiling set that would be excluded by strict filter: 32.13015008100519
% excluded by relaxed filter: 10.020681068747622
