<a target="_blank" href="https://colab.research.google.com/github/kircherlab/MPRAlib/examples/mpralib.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
!pip --quiet install MPRAlib

Here we will load a barcode count file (output of MPRAsnakeflow).

In [2]:
from mpralib.mpradata import MPRAdata
import pandas as pd
import numpy as np

# Load the data
mpradata = MPRAdata.from_file("../resources/reporter_experiment.barcode.HEPG2.fromFile.default.all.tsv.gz")

print("DNA counts")
display(pd.DataFrame(mpradata.dna_counts[:,0:5], index=mpradata.replicates, columns=mpradata.barcodes[0:5]))
print("RNA counts")
display(pd.DataFrame(mpradata.rna_counts[:,0:5], index=mpradata.replicates, columns=mpradata.barcodes[0:5]))


DNA counts


barcode,TACTCTCCGTGCCCA,GGTATAACATCTCCG,TTAGGAGTCACACGT,GAATATAACACCCGA,AAACACCGCGCTCTA
1,1,0,0,1,0
2,0,3,1,1,0
3,0,1,1,1,1


RNA counts


barcode,TACTCTCCGTGCCCA,GGTATAACATCTCCG,TTAGGAGTCACACGT,GAATATAACACCCGA,AAACACCGCGCTCTA
1,1,0,0,1,0
2,0,1,1,1,0
3,0,2,2,3,1


Generating correlatons across repliactes on certain barcode thresholds.

In [3]:
print("Pairwise Pearson correlation")
display(pd.DataFrame(mpradata.pearson_correlation, index=mpradata.replicates, columns=mpradata.replicates).round(3))
print("Parwise Spearman correlation")
display(pd.DataFrame(mpradata.spearman_correlation, index=mpradata.replicates, columns=mpradata.replicates).round(3))


print("Mean Pearson correlation")
print(mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3))
print("Mean Spearman correlation")
print(mpradata.spearman_correlation.flatten()[[1,2,5]].mean().round(3))


# Setting a different barcode threshold 10
mpradata.barcode_threshold = 10
print(f"Mean Pearson correlation, BC threshold {mpradata.barcode_threshold}")
print(mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3))
print(f"Mean Spearman correlation, BC threshold {mpradata.barcode_threshold}")
print(mpradata.spearman_correlation.flatten()[[1,2,5]].mean().round(3))

# And using a very high one 100
mpradata.barcode_threshold = 100
print(f"Mean Pearson correlation, BC threshold {mpradata.barcode_threshold}")
print(mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3))
print(f"Mean Spearman correlation, BC threshold {mpradata.barcode_threshold}")
print(mpradata.spearman_correlation.flatten()[[1,2,5]].mean().round(3))

Pairwise Pearson correlation


  self.grouped_data.layers["rna_normalized"] / self.grouped_data.layers["dna_normalized"]


Unnamed: 0,1,2,3
1,1.0,0.456,0.469
2,0.456,1.0,0.481
3,0.469,0.481,1.0


Parwise Spearman correlation


Unnamed: 0,1,2,3
1,1.0,0.361,0.379
2,0.361,1.0,0.356
3,0.379,0.356,1.0


Mean Pearson correlation
0.469
Mean Spearman correlation
0.365
Mean Pearson correlation, BC threshold 10
0.655
Mean Spearman correlation, BC threshold 10
0.525
Mean Pearson correlation, BC threshold 100
0.964
Mean Spearman correlation, BC threshold 100
1.0


Using such stricht barcode will also decreatse our oligos available per replicate

In [4]:
for threshold in [1, 10, 100]:
    mpradata.barcode_threshold = threshold
    print(f"Number of oligos per individual replicate, using BC threshold {mpradata.barcode_threshold}")
    display(np.sum(mpradata.grouped_data.layers["barcodes"]!=0,axis=1))


Number of oligos per individual replicate, using BC threshold 1


array([2243, 2250, 2253])

Number of oligos per individual replicate, using BC threshold 10


array([1359, 1349, 1357])

Number of oligos per individual replicate, using BC threshold 100


array([10,  3,  3])