## Basic usage

In [1]:
import gnt
import pandas as pd

### Data
First, we load data from Najm et al. "Orthologous CRISPR–Cas9 enzymes for combinatorial genetic screens." These data should be in the format: guide 1, guide 2, gene 1, gene 2, followed by different conditions that were screened.

In [2]:
lfcs = pd.read_csv('https://raw.githubusercontent.com/PeterDeWeirdt/bigpapi/master/data/processed/bigpapi_lfcs.csv')
lfcs

Unnamed: 0,U6 Sequence,H1 Sequence,U6 gene,H1 gene,Day 21_786O,Day 21_A375,Day 21_A549,Day 21_HT29,Day 21_Meljuso,Day 21_OVCAR8
0,AAAGTGGAACTCAGGACATG,AAAAAAAGAGTCGAATGTTTT,HPRT intron,6T,0.421135,0.250043,0.725424,0.635972,0.127104,0.245427
1,AAAGTGGAACTCAGGACATG,AAAGAGTCCACTCTGCACTTG,HPRT intron,UBC,0.040784,0.125369,0.343278,0.524569,-0.175984,0.371689
2,AAAGTGGAACTCAGGACATG,AACAGCTCCGTGTACTGAGGC,HPRT intron,CD81,0.711486,0.857567,1.513217,0.970841,0.630948,0.675330
3,AAAGTGGAACTCAGGACATG,AAGACGAAATTGAAGACGAAG,HPRT intron,CD81,0.451992,0.588394,1.283543,0.771713,0.409791,0.643640
4,AAAGTGGAACTCAGGACATG,AAGCGTACTGCTCATCATCGT,HPRT intron,HSP90AA1,0.477678,-0.652709,0.442170,0.021827,0.187209,-0.120412
...,...,...,...,...,...,...,...,...,...,...
9179,TTCTGACTACAACATCCAGA,TTGCTTTCATTTAATGCTACA,UBB,PARP2,-0.228266,-0.371034,-0.014272,0.570256,-0.437522,-0.717872
9180,TTCTGACTACAACATCCAGA,TTGGGACGAGTCCTGTGAGAA,UBB,IMPDH1,-0.178963,-0.024237,-0.323317,0.630812,-0.421810,0.197531
9181,TTCTGACTACAACATCCAGA,TTTAGGAATTGCTGTTGGGAC,UBB,HPRT intron,-0.266031,-0.429865,-0.145153,0.147415,-0.454209,-0.266580
9182,TTCTGACTACAACATCCAGA,TTTCCATCACTTGGTTGAATA,UBB,BCL2A1,-0.295739,-0.221819,-0.173578,0.231540,-0.430689,-0.377972


### Calculating residuals
From the log fold changes, we calculate **residuals** at the guide level. We reason that interactors for a given “anchor” guide deviates from the expected range of LFCs of its “target” pairs. We fit a linear model between the median LFC of targets paired with controls and the average LFC of constructs with both an anchor and target guide. Negative residuals from this line indicate a synthetic lethal relationship, whereas positive residuals represent a buffering interaction. 

In [3]:
guide_residuals, model_info = gnt.get_guide_residuals(lfcs, ['CD81', 'HPRT intron'])
guide_residuals.sort_values('residual_z')

KeyboardInterrupt: 

### Model info
We can also look at the fit of the linear model for each guide, by considering its R<sup>2</sup>. A low R<sup>2</sup> can represent a phenotypically dominant guide.

In [None]:
model_info.sort_values('R2')

### Combining scores at the gene level
We can then combine a statistic for a gene pair

$(\bar x - \mu)/(\sigma / \sqrt{n})$

Where $\bar x$, $\mu$, $\sigma$ are the sample mean, population mean, and population standard deviation of residuals, and $n$ is the number of guide pairs.

In [None]:
gene_scores = gnt.get_gene_residuals(guide_residuals, 'residual_z')
gene_scores.sort_values('z_score_residual_z')

## Other models: spline, fixed slope and quadratic

There are options to implement different models for calculating residuals at the guide level, including spline, fixed slope and quadratic

In [None]:
spline_residuals, spline_model_info = gnt.get_guide_residuals(lfcs, ['CD81', 'HPRT intron'], model='spline')


In [None]:
spline_residuals.sort_values('residual_z')

In [None]:
spline_gene_scores = gnt.get_gene_residuals(spline_residuals, 'residual_z')
spline_gene_scores.sort_values('z_score_residual_z').head(50)

In [None]:
merged_predictions = spline_gene_scores.merge(gene_scores, how = 'inner', on = ['condition' ,'gene_a', 'gene_b'], suffixes=['_spline', '_linear'])

In [None]:
merged_predictions.plot.scatter('z_score_residual_z_spline', 'z_score_residual_z_linear', alpha=0.4)