# Tutorial: shRNA guide analysis

In this small case study we use _rnalib_ to better understand the potential impact of a set of shRNA guides that were estimated by an (external) prediction tool.
Briefly, we will do the following:
    
- create a subset of the human transcriptome and load the gene sequences
- create a random list of shRNA guide sequences (in a real scenario those would be predicted by an external tool like SplashRNA)
- create a pandas dataframe containing 
    - the guide sequence 
    - a list of transcript ids that contain this guide (exact match) in their spliced RNA sequence
    - a list of gene ids for these transcripts
- filter guides that are not found or that target multiple genes
- check for untargeted transcripts of the targeted genes

Please note that the code in this case study is not optimized and is more explicit than necessary in order to showcase the API and make the example easier to understand. 

## Requirements
Before executing this notebook, you need to install all required *rnalib* requirements as well as optional libraries needed by this notebook.
It is recommended to do this in a [Python virtual environment](https://rnalib.readthedocs.io/en/latest/readme.html#installation).

Note that this notebook as well as *rnalib*'s testing suite use various test resources (genomics data files and indexing structures). There are [different ways](https://rnalib.readthedocs.io/en/latest/readme.html#test-data) 
to get these files. You can either download from GitHub or, if you have `bedtools`, `bgzip` and `tabix` installed, create them automatically by running the `rnalib create_testdata` commandline script or by calling `testdata.create_testdata()` as done below. Refer to `testdata.py` for details how the test data files were created. There are two separate test resource sets:

* *test_resources*: small test datasets that are used by *rnalib*'s test suite
* *large_test_resources*: larger test files that are needed to demonstrate *rnalib* under realistic conditions. 

*Rnalib* knows about the location of the test data via the package-global __RNALIB_TESTDATA__ variable. This variable can either be set via the "RNALIB_TESTDATA" environment variable or by 'monkeypatching' (rna.__RNALIB_TESTDATA__ = ...) as shown below. Once this is done, test data resources can be accessed via `rna.get_resource(<resource_id>)`. *Rnalib* will recreate these resources only if they are not found in the configured output folder.

In [2]:
import os, pathlib, platform
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import traceback
import math
import random

# load rnalib
import rnalib as rna
from rnalib import gi, GI, SEP, display_textarea
display(f"Running rnalib {rna.__version__} on python {platform.python_version()}")

# ensure test data
rna.__RNALIB_TESTDATA__ = "rnalib_testdata/" # monkeypatch test data dir
rna.testdata.create_testdata(rna.__RNALIB_TESTDATA__, (rna.testdata.test_resources,rna.testdata.large_test_resources)) # requires additional tools installed
display(f"Testdata in {rna.__RNALIB_TESTDATA__}")
display_textarea('\n'.join(rna.dir_tree(pathlib.Path(rna.__RNALIB_TESTDATA__))))

'Running rnalib 0.0.3 on python 3.12.1'

Creating testdata:   0%|          | 0/40 [00:00<?, ?it/s]

Creating testdata:   0%|          | 0/11 [00:00<?, ?it/s]



'Testdata in rnalib_testdata/'

## Creation of shRNA guide sequences
First, we build subset of human transcriptome (chr20).

In [3]:
t=rna.Transcriptome(
    annotation_gff = rna.get_resource("full_gencode_gff"),
    annotation_flavour='gencode',
    genome_fa = rna.get_resource("grch38_chr20"),
    copied_fields=['gene_type'],
    feature_filter = rna.TranscriptFilter().include_chromosomes(['chr20']),
    load_sequence_data=True
)
display(t)

Building transcriptome (1 chromosomes)
:   0%|          | 0/1 [00:00<?, ?it/s]

Load sequences:   0%|          | 0/1480 [00:00<?, ?it/s]

Build interval trees:   0%|          | 0/1480 [00:00<?, ?it/s]

Transcriptome with 1480 genes and 5822 tx (+seq)

Now, we create a random set of shRNA guides of length 10. In a real scenario those would be predicted by some external tool such as [SplashRNA](http://splashrna.mskcc.org/).
We use `rnd_seq()`, an *rnalib* utility method, to create random guide sequences with defined (low) GC%. 

In [4]:
random.seed(0) # if you change this, different random sequences will be created
guides=rna.rnd_seq(10, 'GC'* 30 + 'AT' * 70, 20) # create random sequences with expect 30% GC content
print(guides)

['AATGAAATTA', 'TTCTCACTGA', 'AAAACTCAGT', 'AATCATAATG', 'TATTACACCA', 'ATTAGAATAA', 'TAGGTGTGTA', 'CGTATCTTAA', 'CATAGAAATT', 'CAATTTACGC', 'GAAATCGTTC', 'GCGTAAAAAA', 'TACTATAATA', 'TATTCTACAA', 'TTGAGTTCAG', 'TCTTCCCAGC', 'AAAAAGGCAG', 'TTGACACCTC', 'CCTCACACTC', 'TAACACGGTT']


## Calculate target transcripts
Now we search the transcriptome for transcripts that contain respective kmers in their spliced RNA seq. 
To make this fast, we first search for the kmer in the respective gene sequence (candidate_genes) and for those check all spliced tx sequences (overlapping_tx).
Finally, we are interested whether some guides bind RNAs from multiple genes and create a set of gene ids for the overlapping genes (overlapping_genes).
We combine the results in a pandas dataframe.

In [5]:
d=[]
for guide in tqdm(guides, total=len(guides), desc='analyzing guide'):
    candidate_genes={g for g in t.genes if guide in g.sequence} # get genes that contain the guide sequence or its reverse complement
    candidate_genes|={g for g in t.genes if rna.reverse_complement(guide) in g.sequence} # get genes that contain the guide sequence or its reverse complement
    overlapping_tx={tx for g in candidate_genes for tx in g.transcript if guide in tx.spliced_sequence}
    overlapping_genes={tx.parent.gene_name for tx in overlapping_tx}
    d.append(
        {
            'guide_seq': guide,
            'genes': ','.join(overlapping_genes),
            'n_gids': len(overlapping_genes),
            'tids': ','.join([tx.feature_id for tx in overlapping_tx])
        }
    )
df=pd.DataFrame(d)
display(df.head(8))

analyzing guide:   0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,guide_seq,genes,n_gids,tids
0,AATGAAATTA,"SNAP25-AS1,ENSG00000289072,PLCB4,DYNLRB1,ESF1,...",9,"ENST00000662559.1,ENST00000609507.1,ENST000006..."
1,TTCTCACTGA,"SNPH,CTSZ,MKKS,MIR646HG",4,"ENST00000381867.6,ENST00000652676.1,ENST000006..."
2,AAAACTCAGT,"LINC00261,SYCP2,CFAP61,STK4",4,"ENST00000451767.6,ENST00000377306.5,ENST000002..."
3,AATCATAATG,SPO11,1,"ENST00000371263.8,ENST00000345868.8,ENST000004..."
4,TATTACACCA,"SERINC3,XRN2",2,"ENST00000255175.5,ENST00000342374.5,ENST000003..."
5,ATTAGAATAA,ADNP,1,ENST00000371602.9
6,TAGGTGTGTA,ENSG00000228293,1,ENST00000418739.1
7,CGTATCTTAA,,0,


In the dataframe above, we can see that some guides (e.g., CGTATCTTAA) are not found (no gids) while some target multiple 
genes (e.g., AATGAAATTA). This is expected here as we generate short (10bp) sequences with low GC% and sequence complexity
that are likely found in many genomic locations. In a real scenario, one would expect few such cases for guides predicted by SOTA tools.
We continue the analysis by filtering those bad guides...

In [6]:
fil=df[df['n_gids']==1].copy() 
display(fil)

Unnamed: 0,guide_seq,genes,n_gids,tids
3,AATCATAATG,SPO11,1,"ENST00000371263.8,ENST00000345868.8,ENST000004..."
5,ATTAGAATAA,ADNP,1,ENST00000371602.9
6,TAGGTGTGTA,ENSG00000228293,1,ENST00000418739.1
10,GAAATCGTTC,C20orf144,1,ENST00000375222.4
13,TATTCTACAA,APCDD1L-DT,1,"ENST00000447767.1,ENST00000448374.5"
14,TTGAGTTCAG,MACROD2,1,ENST00000490428.5


Now we check for the filtered guides whether all transcripts of the respective genes are targeted.
For this, we query the transcriptome for sets of transcript ids annotated for a given gene and use the *rnalib* `cmp_sets()` method to get shared and unique items when comparing to the set of transcript ids we found to be targeted by each guide.
Finally, we add the number of missed (untargeted) transcripts and the respective gene names to the dataframe.

In [7]:
missed_tx, gene_names=[],[]
for guide, gid, tids in zip(fil['guide_seq'], fil['genes'], fil['tids']):
    all_tid={tx.feature_id for tx in t.gene[gid].transcript}
    found_tids=set(tids.split(','))
    # cmp_sets is a rnalib utility function for set comparison
    s,m,w=rna.cmp_sets(all_tid, found_tids)
    missed_tx.append(f"{len(m)}/{len(all_tid)}")
    gene_names.append(t.gene[gid].gene_name)
    assert len(w)==0, "We should not find a tx that was not found before"
fil['missed_tx']=missed_tx
fil['gene_name']=gene_names
display(fil)

Unnamed: 0,guide_seq,genes,n_gids,tids,missed_tx,gene_name
3,AATCATAATG,SPO11,1,"ENST00000371263.8,ENST00000345868.8,ENST000004...",1/5,SPO11
5,ATTAGAATAA,ADNP,1,ENST00000371602.9,8/9,ADNP
6,TAGGTGTGTA,ENSG00000228293,1,ENST00000418739.1,0/1,ENSG00000228293
10,GAAATCGTTC,C20orf144,1,ENST00000375222.4,1/2,C20orf144
13,TATTCTACAA,APCDD1L-DT,1,"ENST00000447767.1,ENST00000448374.5",7/9,APCDD1L-DT
14,TTGAGTTCAG,MACROD2,1,ENST00000490428.5,15/16,MACROD2


We can see that we are not targetting all annotated transcripts except for ENSG00000228293 (we miss zero of 1 total transcripts).
In a real scenario, we would possibly do such a check only on tx that are actually expressed in the respective
cells (cf. our expression analysis tutorial). 

## Summary

This concludes our small shRNA guide analysis tutorial. This tutorial demonstrated:
* How to generate random shRNA guide sequences with given CG% (in a real analysis, thise guides would be predicted by an external tools)
* Calculate what genes/transcripts are targeted by a given shRNA guide
* Filter for genes thar are uniquely targeted by a shRNA guide and check what fraction of their transcripts are targeted