# Data Processing

The purpose of this notebook is to process the raw reads from any of the Supplementary Data files to log2-fold changes. As an example, Supplemenary Data 1 is used here to calculate LFCs for each variant screened with the PAM-mapping library

In [1]:
import pandas as pd
from poola import core as pool
from scipy.stats import pearsonr

Read in Supplementary Table 1

One tab contains the reads, another the library annotation

In [4]:
panpam_reads = pd.read_excel('../data_files/Supplementary Data/Supplementary_Data1_v3.xlsx', 'PanPAM counts', 
                             skiprows=2, names = ['sgRNA Sequence', 'pDNA', 'WTCas9_RepA', 'WTCas9_RepB',
                                                 'Cas9-HF1_RepA', 'Cas9-HF1_RepB', 'eCas9-1.1_RepA', 'eCas9-1.1_RepB',
                                                 'evoCas9_RepA', 'evoCas9_RepB', 'HypaCas9_RepA', 'HypaCas9_RepB',
                                                 'xCas9-3.7_RepA', 'xCas9-3.7_RepB', 'Cas9-VQR_RepA', 'Cas9-VQR_RepB',
                                                 'Cas9-VRER_RepA', 'Cas9-VRER_RepB', 'Cas9-NG_RepA', 'Cas9-NG_RepB',
                                                 'SpG_RepA', 'SpG_RepB'])
annotation = pd.read_excel('../data_files/Supplementary Data/Supplementary_Data1_v3.xlsx', 'Library annotation')


In [None]:
print(panpam_reads.shape)
panpam_reads.head()

Merge reads with annotation file

In [None]:
panpam_annotated = pd.merge(panpam_reads, annotation, left_on='sgRNA Sequence', right_on='Construct Barcode')

In [None]:
print(panpam_annotated.shape)
panpam_annotated.head()

Subset columns for use in calculating lognorms 

In [None]:
col = panpam_reads.columns
cols = col[1:]
cols

Calculate lognorm from reads, and filter based on pDNA abundance

In [None]:
lognorms = pool.lognorm_columns(reads_df=panpam_annotated, columns=cols)
filtered_lognorms = pool.filter_pdna(lognorm_df=lognorms, pdna_cols=['pDNA'], z_low=-3)
print('Filtered ' + str(lognorms.shape[0] - filtered_lognorms.shape[0]) + ' columns due to low pDNA abundance')

In [None]:
lfc_df = pool.calculate_lfcs(lognorm_df=filtered_lognorms, ref_col='pDNA', target_cols=cols)


Calculate average LFC across replicates. 

In [None]:
lfc_df['WTCas9_AVGLFC_frompDNA'] = lfc_df[['WTCas9_RepA', 'WTCas9_RepB']].mean(axis =1)
lfc_df['Cas9-HF1_AVGLFC_frompDNA'] = lfc_df[['Cas9-HF1_RepA', 'Cas9-HF1_RepB']].mean(axis =1)
lfc_df['eCas9-1.1_AVGLFC_frompDNA'] = lfc_df[['eCas9-1.1_RepA', 'eCas9-1.1_RepB']].mean(axis =1)
lfc_df['evoCas9_AVGLFC_frompDNA'] = lfc_df[['evoCas9_RepA', 'evoCas9_RepB']].mean(axis =1)
lfc_df['HypaCas9_AVGLFC_frompDNA'] = lfc_df[['HypaCas9_RepA', 'HypaCas9_RepB']].mean(axis =1)
lfc_df['xCas9-3.7_AVGLFC_frompDNA'] = lfc_df[['xCas9-3.7_RepA', 'xCas9-3.7_RepB']].mean(axis =1)
lfc_df['Cas9-VQR_AVGLFC_frompDNA'] = lfc_df[['Cas9-VQR_RepA', 'Cas9-VQR_RepB']].mean(axis =1)
lfc_df['Cas9-VRER_AVGLFC_frompDNA'] = lfc_df[['Cas9-VRER_RepA', 'Cas9-VRER_RepB']].mean(axis =1)
lfc_df['Cas9-NG_AVGLFC_frompDNA'] = lfc_df[['Cas9-NG_RepA', 'Cas9-NG_RepB']].mean(axis =1)
lfc_df['SpG_AVGLFC_frompDNA'] = lfc_df[['SpG_RepA', 'SpG_RepB']].mean(axis =1)


Save this file (which includes single rep, and avg lfc) for each cell line screened with the PanPAM library. This will serve as the input for the "Activity Metrics" notebook

In [18]:
lfc_df.to_csv('../../data_v3/Fig 1_3_PanPAM on-target/processed/panpam_avglfc_v2.txt', sep='\t')