# Tutorial: shRNA guide analysis

In this small case study we will do the following:
    
- create a subset of the human transcriptome and load the gene sequences
- create a random list of shRNA guide sequences (in a real scenario those would, e.g., predicted by some external tool)
- create a pandas dataframe containing 
    - the guide sequence 
    - a list of transcript ids that contain this guide (exact match) in their spliced RNA sequence
    - a list of gene ids for these transcripts
- filter guides that are not found or that target multiple genes
- check for untargeted transcripts of the targeted genes

Please note that the code in this case study is not optimized and is more explicit than necessary in order to showcase the API and make the example easier to understand. 

In [1]:
import os, pathlib, platform
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import traceback
import math
import random

# load rnalib
import rnalib as rna
from rnalib import gi, SEP, display_textarea
plt.rcParams["figure.figsize"] = (20,3)

display(f"Running rnalib {rna.__version__} on python {platform.python_version()}")

'Running rnalib 1.0.0 on python 3.10.4'

## Test datasets

This notebook as well as rnalib's testing suite use various test resources (genomics data files and indexing structures) that can be created by 
running the rnalib `rnalib_create_testdata` script or by calling the `testdata.create_testdata()` method. There are two seperate resource sets:

* test_resources: small test datasets that are used by rnalib's test suite
* large_test_resources: larger test files that are needed to demonstrate rnalib under realistic conditions. 

Rnalib knows about the test data directory via the package-global __RNALIB_TESTDATA__ variable. This variable can either be set via the "RNALIB_TESTDATA" environment variable or by monkeypatching (rna.__RNALIB_TESTDATA__ = <mydir>) as shown below. Once this is done, test data resources can be accessed via 
`get_resource(<resource_id>)`. Rnalib will recreate these resources only if they are not found in the provided output folder.


In [2]:
rna.__RNALIB_TESTDATA__ = "rnalib_testdata/"

if not os.path.isdir(rna.__RNALIB_TESTDATA__):
    os.mkdir(rna.__RNALIB_TESTDATA__)
    display("Creating testdata at {rna.__RNALIB_TESTDATA__ }")
    rna.testdata.create_testdata(rna.__RNALIB_TESTDATA__, 
                                 rna.testdata.large_test_resources)
else:
    display(f"Testdata at {rna.__RNALIB_TESTDATA__}")
    rna.print_dir_tree(rna.__RNALIB_TESTDATA__)

'Testdata at rnalib_testdata/'

├── bigfiles
│   ├── grch38_chr20.fa.gz.gzi
│   ├── chess3.0.1.gtf.gz
│   ├── gencode_39.gff3.gz
│   ├── hgnc_complete_set.txt
│   ├── gencode_39.gff3.gz.tbi
│   ├── grch38_chr20.fa.gz
│   ├── GRCh38.k24.umap.bedgraph.gz
│   ├── grch38_chr20.fa.gz.fai
│   ├── chess3.0.1.gtf.gz.tbi
│   └── GRCh38.k24.umap.bedgraph.gz.tbi
├── bed
│   ├── test_bed12.bed.gz.tbi
│   ├── test_bed12.bed.gz
│   ├── test.bed.gz
│   ├── test.bedgraph.gz
│   ├── pybedtools_snps.bed.gz
│   ├── test_nist.b37_chr20_100kbp_at_10mb.bed
│   ├── dmel_randomvalues.bedgraph.gz
│   ├── GRCh38.k24.umap.ACTB_ex1+2.bedgraph.gz
│   ├── dmel_randomvalues.bedgraph.gz.tbi
│   ├── test.bedgraph.gz.tbi
│   └── ...
├── div
│   └── hgnc_complete_set.head.txt.gz
├── bam
│   ├── NA12878_S1.chr20.10_10p1mb.bam
│   ├── mapt.NA12156.altex.small.bam
│   ├── NA12878_S1.chr20.10_10p1mb.bam.bai
│   ├── mapt.NA12156.altex.small.bam.bai
│   ├── rogue_read.bam.bai
│   ├── rogue_read.bam
│   ├── small.ACTB+SOX2.bam.bai
│   ├── small_example.bam
│

In [11]:
# Build subset of human transcriptome (chr20)
t=rna.Transcriptome(
    annotation_gff = rna.get_resource("full_gencode_gff"),
    annotation_flavour='gencode',
    genome_fa = rna.get_resource("grch38_chr20"),
    copied_fields=['gene_type'],
    feature_filter = rna.TranscriptFilter().include_chromosomes(['chr20']),
    load_sequence_data=True
)

display(t)

Building transcriptome (1 chromosomes)
:   0%|          | 0/1 [00:00<?, ?it/s]

Load sequences:   0%|          | 0/1480 [00:00<?, ?it/s]

Build interval trees:   0%|          | 0/1480 [00:00<?, ?it/s]

Transcriptome with 1480 genes and 5822 tx (+seq)

In [12]:
# create a random set of shRNA guides of length 10. In a real scenario those would, e.g., be predicted by some external tool.
# Here, we use rnd_seq, a convenience rnalib method, to create 3 random guide sequences with low GC%. 
random.seed(0) # if you change this, different random sequences will be created
guides=rna.rnd_seq(10, 'GC'* 30 + 'AT' * 70, 20) 
print(guides)

['AATGAAATTA', 'TTCTCACTGA', 'AAAACTCAGT', 'AATCATAATG', 'TATTACACCA', 'ATTAGAATAA', 'TAGGTGTGTA', 'CGTATCTTAA', 'CATAGAAATT', 'CAATTTACGC', 'GAAATCGTTC', 'GCGTAAAAAA', 'TACTATAATA', 'TATTCTACAA', 'TTGAGTTCAG', 'TCTTCCCAGC', 'AAAAAGGCAG', 'TTGACACCTC', 'CCTCACACTC', 'TAACACGGTT']


In [13]:
# now we search for transcripts that contain respective kmers in their spliced RNA seq. 
# To make this fast, we first search for the kmer in the respective gene sequence (candidate_genes) and for those
# check all spliced tx sequences (overlapping_tx).
# Finally, we are interested whether some guides bind RNAs from multiple genes and create a set of gene ids for the overlapping genes (overlapping_genes).
# We combine the results in a pandas dataframe
d=[]
for guide in tqdm(guides, total=len(guides), desc='analyzing guide'):
    candidate_genes={g for g in t.genes if guide in g.sequence}
    overlapping_tx={tx for g in candidate_genes for tx in g.transcript if guide in tx.spliced_sequence}
    overlapping_genes={tx.parent.feature_id for tx in overlapping_tx}
    d.append(
        {
            'guide_seq': guide,
            'gids': ','.join(overlapping_genes),
            'n_gids': len(overlapping_genes),
            'tids': ','.join([tx.feature_id for tx in overlapping_tx])
        }
    )
df=pd.DataFrame(d)
display(df.head(8))

analyzing guide:   0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,guide_seq,gids,n_gids,tids
0,AATGAAATTA,"ENSG00000089048.15,ENSG00000078747.16,ENSG0000...",7,"ENST00000202816.5,ENST00000661614.1,ENST000006..."
1,TTCTCACTGA,"ENSG00000101298.15,ENSG00000228340.7",2,"ENST00000381867.6,ENST00000381873.7,ENST000004..."
2,AAAACTCAGT,"ENSG00000089101.19,ENSG00000196074.13,ENSG0000...",3,"ENST00000499879.6,ENST00000377306.5,ENST000003..."
3,AATCATAATG,ENSG00000054796.13,1,"ENST00000494972.1,ENST00000371260.8,ENST000003..."
4,TATTACACCA,ENSG00000088930.8,1,ENST00000377191.5
5,ATTAGAATAA,ENSG00000101126.18,1,ENST00000371602.9
6,TAGGTGTGTA,ENSG00000228293.1,1,ENST00000418739.1
7,CGTATCTTAA,,0,


In [14]:
# In the DF above, we can see that some guides (e.g., CGTATCTTAA) are not found (no gids) while some target multiple 
# genes (e.g., AATGAAATTA). This is expected here as we generate short (10bp) sequences with low GC%/sequence complexity
# that are likely found in many genomic locations. In a real scenario, one would expect few such cases for guides predicted by SOTA tools.
# We continue the analysis by filtering those bad guides...
fil=df[df['n_gids']==1].copy() 
display(fil)

Unnamed: 0,guide_seq,gids,n_gids,tids
3,AATCATAATG,ENSG00000054796.13,1,"ENST00000494972.1,ENST00000371260.8,ENST000003..."
4,TATTACACCA,ENSG00000088930.8,1,ENST00000377191.5
5,ATTAGAATAA,ENSG00000101126.18,1,ENST00000371602.9
6,TAGGTGTGTA,ENSG00000228293.1,1,ENST00000418739.1
10,GAAATCGTTC,ENSG00000149609.6,1,ENST00000375222.4
12,TACTATAATA,ENSG00000088812.18,1,ENST00000262919.10
13,TATTCTACAA,ENSG00000231290.6,1,"ENST00000448374.5,ENST00000447767.1"
14,TTGAGTTCAG,ENSG00000172264.18,1,ENST00000490428.5


In [15]:
# ...and now we want to check for the remaining ones whether all tx of the respective genes are targeted.
# For this, we query the transcriptome for sets of transcript ids annotated for a given gene and use the 
# rnalib cmp_sets() method to get shared and unique items when comparing to the set of transcript ids we 
# found to be targeted by each guide.
# Finally, we add the number of missed (untargeted) transcripts and the respective gene names to the dataframe.
missed_tx, gene_names=[],[]
for guide, gid, tids in zip(fil['guide_seq'], fil['gids'], fil['tids']):
    all_tid={tx.feature_id for tx in t.gene[gid].transcript}
    found_tids=set(tids.split(','))
    # cmp_sets is a rnalib utility function for set comparison
    s,m,w=rna.cmp_sets(all_tid, found_tids)
    missed_tx.append(f"{len(m)}/{len(all_tid)}")
    gene_names.append(t.gene[gid].gene_name)
    assert len(w)==0, "We should not find a tx that was not found before"
fil['missed_tx']=missed_tx
fil['gene_name']=gene_names
display(fil)

Unnamed: 0,guide_seq,gids,n_gids,tids,missed_tx,gene_name
3,AATCATAATG,ENSG00000054796.13,1,"ENST00000494972.1,ENST00000371260.8,ENST000003...",1/5,SPO11
4,TATTACACCA,ENSG00000088930.8,1,ENST00000377191.5,0/1,XRN2
5,ATTAGAATAA,ENSG00000101126.18,1,ENST00000371602.9,8/9,ADNP
6,TAGGTGTGTA,ENSG00000228293.1,1,ENST00000418739.1,0/1,ENSG00000228293
10,GAAATCGTTC,ENSG00000149609.6,1,ENST00000375222.4,1/2,C20orf144
12,TACTATAATA,ENSG00000088812.18,1,ENST00000262919.10,1/2,ATRN
13,TATTCTACAA,ENSG00000231290.6,1,"ENST00000448374.5,ENST00000447767.1",7/9,APCDD1L-DT
14,TTGAGTTCAG,ENSG00000172264.18,1,ENST00000490428.5,15/16,MACROD2


We can see that we are not targetting all annotated transcripts except for XRN2 (we miss zero of 1 total transcripts).
In a real scenario, we would possssibly do such a check only on tx that are actually expressed in the respective
cells. 