The purpose of this notebook is to process the raw reads from Supplementary Data 1 to log2-fold changes for each variant screened with the PAM-mapping library

In [1]:
import pandas as pd
from poola import core as pool
from scipy.stats import pearsonr

Read in Supplementary Table 1

One tab contains the reads, another the library annotation

In [2]:
panpam_reads = pd.read_excel('../../Supplementary Data/Supplementary_Data1_v3.xlsx', 'PanPAM counts', 
                             skiprows=2, names = ['sgRNA Sequence', 'pDNA', 'WTCas9_RepA', 'WTCas9_RepB',
                                                 'Cas9-HF1_RepA', 'Cas9-HF1_RepB', 'eCas9-1.1_RepA', 'eCas9-1.1_RepB',
                                                 'evoCas9_RepA', 'evoCas9_RepB', 'HypaCas9_RepA', 'HypaCas9_RepB',
                                                 'xCas9-3.7_RepA', 'xCas9-3.7_RepB', 'Cas9-VQR_RepA', 'Cas9-VQR_RepB',
                                                 'Cas9-VRER_RepA', 'Cas9-VRER_RepB', 'Cas9-NG_RepA', 'Cas9-NG_RepB',
                                                 'SpG_RepA', 'SpG_RepB'])
annotation = pd.read_excel('../../Supplementary Data/Supplementary_Data1_v3.xlsx', 'Library annotation')


In [7]:
print(panpam_reads.shape)
panpam_reads.head()

(18768, 22)


Unnamed: 0,sgRNA Sequence,pDNA,WTCas9_RepA,WTCas9_RepB,Cas9-HF1_RepA,Cas9-HF1_RepB,eCas9-1.1_RepA,eCas9-1.1_RepB,evoCas9_RepA,evoCas9_RepB,...,xCas9-3.7_RepA,xCas9-3.7_RepB,Cas9-VQR_RepA,Cas9-VQR_RepB,Cas9-VRER_RepA,Cas9-VRER_RepB,Cas9-NG_RepA,Cas9-NG_RepB,SpG_RepA,SpG_RepB
0,AAAAAAAGAATCCTTACCGC,2110,608,408,1327,983,345,227,77,309,...,891,796,391,747,424,440,1241,1187,1441,720
1,AAAAAAATACCGAAAGACCA,811,217,86,188,251,110,237,40,161,...,336,264,306,528,194,175,920,663,613,386
2,AAAAAACGCTTACTTGGGAT,1349,323,367,485,650,224,326,49,132,...,282,346,278,597,724,179,1599,1034,1145,537
3,AAAAAACGGCTCTCTCAACG,2460,811,634,992,996,321,539,88,395,...,996,913,639,1053,438,384,1556,1749,1921,793
4,AAAAAAGAATCCTTACCGCT,848,246,194,164,363,76,219,52,65,...,227,510,159,261,186,145,613,1020,496,302


Merge reads with annotation file

In [8]:
panpam_annotated = pd.merge(panpam_reads, annotation, left_on='sgRNA Sequence', right_on='Construct Barcode')

In [9]:
print(panpam_annotated.shape)
panpam_annotated.head()

(18768, 30)


Unnamed: 0,sgRNA Sequence,pDNA,WTCas9_RepA,WTCas9_RepB,Cas9-HF1_RepA,Cas9-HF1_RepB,eCas9-1.1_RepA,eCas9-1.1_RepB,evoCas9_RepA,evoCas9_RepB,...,SpG_RepA,SpG_RepB,Construct Barcode,Construct IDs,PAM,Gene,G19,g20,G20,Context Sequence
0,AAAAAAAGAATCCTTACCGC,2110,608,408,1327,983,345,227,77,309,...,1441,720,AAAAAAAGAATCCTTACCGC,Essential,TGTA,GEMIN5,n,y,n,GCTCAAAAAAAGAATCCTTACCGCTGTAACC
1,AAAAAAATACCGAAAGACCA,811,217,86,188,251,110,237,40,161,...,613,386,AAAAAAATACCGAAAGACCA,BRCA2,AAAA,BRCA2,n,y,n,AGGAAAAAAAATACCGAAAGACCAAAAATCA
2,AAAAAACGCTTACTTGGGAT,1349,323,367,485,650,224,326,49,132,...,1145,537,AAAAAACGCTTACTTGGGAT,Essential,CAAC,SRSF11,n,n,y,AAAGAAAAAACGCTTACTTGGGATCAACAGT
3,AAAAAACGGCTCTCTCAACG,2460,811,634,992,996,321,539,88,395,...,1921,793,AAAAAACGGCTCTCTCAACG,Essential,CTAA,GEMIN5,y,y,n,TGGAGAAAAAACGGCTCTCTCAACCTAAGGC
4,AAAAAAGAATCCTTACCGCT,848,246,194,164,363,76,219,52,65,...,496,302,AAAAAAGAATCCTTACCGCT,Essential,GTAA,GEMIN5,n,y,n,CTCAAAAAAAGAATCCTTACCGCTGTAACCT


Subset columns for use in calculating lognorms 

In [10]:
col = panpam_reads.columns
cols = col[1:]
cols

Index(['pDNA', 'WTCas9_RepA', 'WTCas9_RepB', 'Cas9-HF1_RepA', 'Cas9-HF1_RepB',
       'eCas9-1.1_RepA', 'eCas9-1.1_RepB', 'evoCas9_RepA', 'evoCas9_RepB',
       'HypaCas9_RepA', 'HypaCas9_RepB', 'xCas9-3.7_RepA', 'xCas9-3.7_RepB',
       'Cas9-VQR_RepA', 'Cas9-VQR_RepB', 'Cas9-VRER_RepA', 'Cas9-VRER_RepB',
       'Cas9-NG_RepA', 'Cas9-NG_RepB', 'SpG_RepA', 'SpG_RepB'],
      dtype='object')

Calculate lognorm from reads, and filter based on pDNA abundance

In [13]:
lognorms = pool.lognorm_columns(reads_df=panpam_annotated, columns=cols)
filtered_lognorms = pool.filter_pdna(lognorm_df=lognorms, pdna_cols=['pDNA'], z_low=-3)
print('Filtered ' + str(lognorms.shape[0] - filtered_lognorms.shape[0]) + ' columns due to low pDNA abundance')

Filtered 117 columns due to low pDNA abundance


In [15]:
lfc_df = pool.calculate_lfcs(lognorm_df=filtered_lognorms, ref_col='pDNA', target_cols=cols)


Calculate average LFC across replicates. 

In [17]:
lfc_df['WTCas9_AVGLFC_frompDNA'] = lfc_df[['WTCas9_RepA', 'WTCas9_RepB']].mean(axis =1)
lfc_df['Cas9-HF1_AVGLFC_frompDNA'] = lfc_df[['Cas9-HF1_RepA', 'Cas9-HF1_RepB']].mean(axis =1)
lfc_df['eCas9-1.1_AVGLFC_frompDNA'] = lfc_df[['eCas9-1.1_RepA', 'eCas9-1.1_RepB']].mean(axis =1)
lfc_df['evoCas9_AVGLFC_frompDNA'] = lfc_df[['evoCas9_RepA', 'evoCas9_RepB']].mean(axis =1)
lfc_df['HypaCas9_AVGLFC_frompDNA'] = lfc_df[['HypaCas9_RepA', 'HypaCas9_RepB']].mean(axis =1)
lfc_df['xCas9-3.7_AVGLFC_frompDNA'] = lfc_df[['xCas9-3.7_RepA', 'xCas9-3.7_RepB']].mean(axis =1)
lfc_df['Cas9-VQR_AVGLFC_frompDNA'] = lfc_df[['Cas9-VQR_RepA', 'Cas9-VQR_RepB']].mean(axis =1)
lfc_df['Cas9-VRER_AVGLFC_frompDNA'] = lfc_df[['Cas9-VRER_RepA', 'Cas9-VRER_RepB']].mean(axis =1)
lfc_df['Cas9-NG_AVGLFC_frompDNA'] = lfc_df[['Cas9-NG_RepA', 'Cas9-NG_RepB']].mean(axis =1)
lfc_df['SpG_AVGLFC_frompDNA'] = lfc_df[['SpG_RepA', 'SpG_RepB']].mean(axis =1)


Save this file (which includes single rep, and avg lfc) for each cell line screened with the PanPAM library. This will serve as the input for the "Activity Metrics" notebook

In [18]:
lfc_df.to_csv('../../data_v3/Fig 1_3_PanPAM on-target/processed/panpam_avglfc_v2.txt', sep='\t')