<a target="_blank" href="https://colab.research.google.com/github/kircherlab/MPRAlib/examples/mpralib.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [36]:
!pip --quiet install MPRAlib

# Loading Data and Understanding the MPRAlib data structure

Here we will load a barcode count file (output of [MPRAsnakeflow](https://github.com/kircherlab/MPRAsnakeflow)) into MPRAlib, more precise into an `MPRAData` data object in `mpralib.mpradata`. MPRAlib will generate an [AnnData](https://anndata.readthedocs.io) object out of it, which is acessable via `MRAData.data`:

<img src="https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg" alt="AnnData" width="480 "/>

## Core data

- `var`: barcodes
- `obs`: replicates

Oligo names are also stored in `var` like `MRAData.data.var["ologos"]`. Further DNA and RNA counts are stored in `layers` like `MRAData.data.layers[""dna"]` and `layers["rna"]`. This are the raw counst and those layers are never been modified.

Metadata is stored in the `MPRA.data.uns` dictionary.

## Extended count data

There are some further layers that are dynamically updated via the library. This are:

1. Normalization layers in `MRAData.data.layers["dna_normalized"]` and `MRAData.data.layers["rna_normalized"]`. Usually we have to normalize the data to account for different sequencing depths. This is done by dividing the raw counts by the sum of the counts all counts per replicate and scale them (usually `1e6` like counts per million).
2. Sampling layers in `MRAData.data.layers["dna_sampling"]` and `MRAData.data.layers["rna_sampling"]`. This is only an edge case when you want to downsample your counts.

The library usually generates the layers annd takes care of them. E.g. when barcode filtering is applied, the normalization layers are updated accordingly.

MPRAlib provides some getters (in python properties) to access the data intuitively. E.g. `MRAData.raw_dna_counts` will return the raw DNA counts. `MRAData.raw_rna_counts` for RNA accordingly. If you use the property `MPRAdata.rna_counts` or `MPRAdata.dna_counts` it will return the latest counts. That can be the raw counts (if nothing was done), Filtered counts when where some barcode counts are set to zero according to the barcode filter mask (see below), sampled or sampled and filtered counts. If you want to get norlazied counts use `MPRAdata.normalized_rna_counts` or `MPRAdata.normalized_dna_counts`. It returns the layer `"rna_normalized"` or `"dna_normalized"` which is a normalization of `MPRAdata.rna_counts` or `MPRAdata.dna_counts`.

## Barcode filtering

Barcode filters ar stored in a n_barcodes x n_replicates matrix within the AnnData object `MPRAData.data.varm["barcode_filter"]`. You can set any new filter using the setter `MPRAdata.barcode_filter = new_filter`. The filter is a boolean matrix with `True` for barcodes that should be removed and `False` for barcodes that should be kept. This setter magically also updates the normalized counts for you.

There are pre implemented methods to filter barcodes, like detecting outliers. We willcover thi part later.

## Grouping Data by Oligo

Usually when we work with MPRA data we are not interested by the barcode count itself but an aggregated count version per oligo. This is done by grouping the data by oligo and writing a new AnnData object into our library, called `MPRAdata.grouped_data`. It is a new AnnData object because the data structure is slightly different. The `var` is now the oligos and the `obs` are the replicates. The layers `"dna"`, `"rna"`, `"dna_normalized"`, and  `"rna_normalized"` exist also for the grouped data but are now aggregated per oligo. Also usually the filtered counts or sampled counts are used for the grouped data (if applicable). Also it uses a barcode threshold to filter out oligos that have not enough barcodes. This threshold can be set by `MPRAdata.barcode_threshold = 10` for example. The default is 1 (no Oligo is removed).

We have two additional layers: 1. `MPRAData.grouped_data.layers["log2FoldChange"]` stores the log2 rna/dna ratio. 2. `MPRAData.grouped_data.layers["barcodes"]` counts the number of barcodes per oligo that was used for teh aggregation.

## Example

Let's start with loading a barcode count file from [MPRAsnakeflow](https://github.com/kircherlab/MPRAsnakeflow):

In [37]:
# Loading the MPRAlib library
from mpralib.mpradata import MPRAdata
# Loading other libraries
import pandas as pd
import numpy as np

# Load the data
mpradata = MPRAdata.from_file("../resources/reporter_experiment.barcode.HEPG2.fromFile.default.all.tsv.gz")

# Getting counts, no filtering/sampling done, so raw counts
print("DNA counts")
display(pd.DataFrame(mpradata.dna_counts[:,0:5], index=mpradata.replicates, columns=mpradata.barcodes[0:5]))
print("RNA counts")
display(pd.DataFrame(mpradata.rna_counts[:,0:5], index=mpradata.replicates, columns=mpradata.barcodes[0:5]))


DNA counts


barcode,TACTCTCCGTGCCCA,GGTATAACATCTCCG,TTAGGAGTCACACGT,GAATATAACACCCGA,AAACACCGCGCTCTA
1,1,0,0,1,0
2,0,3,1,1,0
3,0,1,1,1,1


RNA counts


barcode,TACTCTCCGTGCCCA,GGTATAACATCTCCG,TTAGGAGTCACACGT,GAATATAACACCCGA,AAACACCGCGCTCTA
1,1,0,0,1,0
2,0,1,1,1,0
3,0,2,2,3,1


Generating correlations across repliactes on certain barcode thresholds:

In [38]:
print("Pairwise Pearson correlation")
print(pd.DataFrame(mpradata.pearson_correlation, index=mpradata.replicates, columns=mpradata.replicates).round(3))
print("Parwise Spearman correlation")
print(pd.DataFrame(mpradata.spearman_correlation, index=mpradata.replicates, columns=mpradata.replicates).round(3))


# Setting a different barcode threshold 1, 10, 50 and getting the average across replicates
for threshold in [1, 10, 50]:
    mpradata.barcode_threshold = threshold
    print(f"Mean Pearson correlation activity, BC threshold {mpradata.barcode_threshold}: {mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3)}")
    
    print(f"Mean Pearson correlation RNA counts (normalized), BC threshold {mpradata.barcode_threshold}: {mpradata.pearson_correlation_rna.flatten()[[1,2,5]].mean().round(3)}")
    
    print(f"Mean Pearson correlation DNA counts (normalized), BC threshold {mpradata.barcode_threshold}: {mpradata.pearson_correlation_dna.flatten()[[1,2,5]].mean().round(3)}")

    print("")

Pairwise Pearson correlation


  self.grouped_data.layers["rna_normalized"] / self.grouped_data.layers["dna_normalized"]


       1      2      3
1  1.000  0.456  0.469
2  0.456  1.000  0.481
3  0.469  0.481  1.000
Parwise Spearman correlation
       1      2      3
1  1.000  0.361  0.379
2  0.361  1.000  0.356
3  0.379  0.356  1.000
Mean Pearson correlation activity, BC threshold 1: 0.469
Mean Pearson correlation RNA counts (normalized), BC threshold 1: -0.142
Mean Pearson correlation DNA counts (normalized), BC threshold 1: -0.14

Mean Pearson correlation activity, BC threshold 10: 0.655
Mean Pearson correlation RNA counts (normalized), BC threshold 10: 0.037
Mean Pearson correlation DNA counts (normalized), BC threshold 10: -0.135

Mean Pearson correlation activity, BC threshold 50: 0.839
Mean Pearson correlation RNA counts (normalized), BC threshold 50: 0.739
Mean Pearson correlation DNA counts (normalized), BC threshold 50: 0.195



Using such strict barcode threshold will also reduce our oligos available per replicate:

In [39]:
for threshold in [1, 10, 50]:
    mpradata.barcode_threshold = threshold
    print(f"Number of oligos per individual replicate, using BC threshold {mpradata.barcode_threshold}")
    print(np.sum(mpradata.grouped_data.layers["barcodes"]!=0,axis=1))


Number of oligos per individual replicate, using BC threshold 1
[2243 2250 2253]
Number of oligos per individual replicate, using BC threshold 10
[1359 1349 1357]
Number of oligos per individual replicate, using BC threshold 50
[148 146 154]


## Barcode filtering and outlier detection

There is some pre-build function to remove barcodes from experiments. This is done via the function `MPRAData.apply_barcode_filter()`. This function has pre-build filters like setting a minimum or maximum threshold for counts. E.g. you want to remove very very high barcode counts or you 'trust' only barcode counts with 3 RNA counts to remove noisy data. The functin can also randomly remove barcodes if you want to downsample your data on barcodes (Which correspons to removing barcodes from your assignment, theerfor ehaving a lower quality/lower depth assignment file).

But it can also detect outlier barcodes. E.g. if you want to remove barcodes that do not follow the distribution per oligo. This is usually done per repliacte.

Now we try to remove barcodes that are outliers within the RNA counts using the zscore:

In [40]:
from mpralib.mpradata import BarcodeFilter

# using standard number of barcodes per oligo
mpradata.barcode_threshold = 10

# Resetting the barcode filter
mpradata.barcode_filter = None
print(f"Mean Pearson correlation before BC filter, BC threshold {mpradata.barcode_threshold}")
print(mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3))
print(f"Number of oligs")
print(np.sum(mpradata.grouped_data.layers["barcodes"]!=0,axis=1))

# Apply filter
mpradata.apply_barcode_filter(BarcodeFilter.RNA_ZSCORE, {"times_zscore": 3})

print(f"Mean Pearson correlation after BC filter, BC threshold {mpradata.barcode_threshold}")
print(mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3))
print(f"Number of oligs")
print(np.sum(mpradata.grouped_data.layers["barcodes"]!=0,axis=1))

Mean Pearson correlation before BC filter, BC threshold 10
0.655
Number of oligs
[1359 1349 1357]
Mean Pearson correlation after BC filter, BC threshold 10
0.594
Number of oligs
[1359 1349 1357]


You see we keeping number of oligos the same but our Person correlaton drops. Maybe within this example dataset this was not a good idea. But we want to see the barcodes that were removed. We can access them using the  `MPRAData.barcode_filter` matrix on the `MPRAData.data.var` object.

In [None]:
#TODO
# 
#  Example: Iterate over the first axis (rows) of a numpy array
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

def process_row(row):
    # Perform some operation on each row
    print(row)

np.apply_along_axis(process_row, axis=1, arr=array)

Let's see if we achieve something else when allowing only a higher number of DNA and RNA counts:

In [41]:
mpradata.barcode_filter = None
print(f"Mean Pearson correlation before BC filter, BC threshold {mpradata.barcode_threshold}")
print(mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3))
print(f"Number of oligs")
print(np.sum(mpradata.grouped_data.layers["barcodes"]!=0,axis=1))

# Apply filter
mpradata.apply_barcode_filter(BarcodeFilter.MIN_COUNT, {"dna_min_count": 2, "rna_min_count": 4})

print(f"Mean Pearson correlation after BC filter, BC threshold {mpradata.barcode_threshold}")
print(mpradata.pearson_correlation.flatten()[[1,2,5]].mean().round(3))
print(f"Number of oligs")
print(np.sum(mpradata.grouped_data.layers["barcodes"]!=0,axis=1))

Mean Pearson correlation before BC filter, BC threshold 10
0.655
Number of oligs
[1359 1349 1357]
Mean Pearson correlation after BC filter, BC threshold 10
0.78
Number of oligs
[12  5  8]


Now we see an improved correlaton. But hold on... We have nearly no oligo left that fullfills the BC threshold 10 criteria. This example data seems to be a very very low count data. Maybe not the best idea. 

## Saving activity files.

Finally we want to safe our oligo activities. We have a pre-build function for that.

In [42]:
from mpralib.utils import export_activity_file, export_barcode_file
export_activity_file(mpradata, "activity.tsv")

df = pd.read_csv("activity.tsv", sep='\t')
display(df)

Unnamed: 0,replicate,oligo_name,dna_counts,rna_counts,dna_normalized,rna_normalized,log2FoldChange,n_bc
0,1,A:HNF4A-ChMod_chr5:78281698-78281840__chr5:782...,71,154,839.8032,995.9447,0.246,27
1,1,A:HNF4A-NoMod_chr17:27505275-27505406__chr17:2...,24,62,786.6728,1069.6776,0.4433,10
2,1,A:HNF4A-NoMod_chr17:37895425-37895596__chr17:3...,49,123,861.2268,1163.7696,0.4343,18
3,1,C:SLEA_hg18:chr9:82902419-82902586|13:V_HNF6_Q...,40,100,848.3727,1139.0086,0.425,15
4,1,R:EP300-ChMod_chr19:3397595-3397766__chr19:339...,36,77,826.337,965.6812,0.2248,14
5,1,R:EP300-ChMod_chr4:6784271-6784442__chr4:67842...,39,69,875.9172,880.7861,0.008,14
6,1,R:EP300-NoMod_chr3:23958571-23958742__chr3:239...,27,47,799.2933,783.3498,-0.0291,11
7,1,R:FOXA1-ChMod_chr6:137674036-137674152__chr6:1...,27,48,856.0851,861.6847,0.0094,10
8,1,R:FOXA1_FOXA2-ChMod_chr5:74172389-74172557__ch...,61,89,1113.4891,974.9666,-0.1917,16
9,1,R:FOXA1_FOXA2-ChMod_chr6:11393329-11393497__ch...,24,44,786.6728,802.2582,0.0283,10


So every oligo per replicate is written into one row with its counts, normalized counts, log2 fold change and the number of supporting barcodes. In total there are only 25 entries beacuse we did this drastical filtering before.

We can also write out the results in a barcode format

In [43]:
export_barcode_file(mpradata, "barcodes.tsv")

df = pd.read_csv("barcodes.tsv", sep='\t')
display(df)

Unnamed: 0,barcode,oligo_name,dna_count_1,rna_count_1,dna_count_2,rna_count_2,dna_count_3,rna_count_3
0,TACTCTCCGTGCCCA,A:HNF4A-ChMod_chr10:11917871-11917984__chr10:1...,,,,,,
1,GGTATAACATCTCCG,A:HNF4A-ChMod_chr10:11917871-11917984__chr10:1...,,,,,,
2,TTAGGAGTCACACGT,A:HNF4A-ChMod_chr10:11917871-11917984__chr10:1...,,,,,,
3,GAATATAACACCCGA,A:HNF4A-ChMod_chr10:11917871-11917984__chr10:1...,,,,,,
4,AAACACCGCGCTCTA,A:HNF4A-ChMod_chr10:11917871-11917984__chr10:1...,,,,,,
...,...,...,...,...,...,...,...,...
92724,ACCCTATCAACGGCT,R:HNF4A-NoMod_chrY:18213828-18213963__chrY:182...,,,,,,
92725,AAGCGCTCGGCGATT,R:HNF4A-NoMod_chrY:18213828-18213963__chrY:182...,,,,,,
92726,CCCGACGGTCCCGGC,R:HNF4A-NoMod_chrY:18213828-18213963__chrY:182...,,,,,,
92727,CCAATCATCGAACTT,R:HNF4A-NoMod_chrY:18213828-18213963__chrY:182...,,,,,,


This dumped the whole DNA and rna counts into a file. Same format as we loaded the data. BUT now with the filtereing applied. So we see a lot of missing values.