# Preparing Intron Map
For this prepatory step, we extract intron information from gff3 annotations and convert the intron loci into BED format for easy overlap detection.
<br>
For our purposes in studying C. elegans, we used the WS253 genome build for all of our analysis. Following these steps, one can easily adapt this pipeline to use for other species.



In [1]:
import pandas as pd
import numpy as np

## Load Genome Annotations
Note that gff3 format is 1 based and end inclusive, while bed is 0 based and end exclusive. So a gff3 format start:end of 1:100 would be equivalent to bed format start:end of 0:100.

In [2]:
ws253_annotations = pd.read_csv("genomes/ws253/ws253.annotations.gff3", skiprows=8, sep="\t", header=None)
ws253_annotations.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,I,BLAT_EST_OTHER,expressed_sequence_match,1,50,12.8,-,.,ID=yk585b5.5.6;Target=yk585b5.5 119 168 +
1,I,BLAT_Trinity_OTHER,expressed_sequence_match,1,52,20.4,+,.,ID=elegans_PE_SS_GG6116|c0_g1_i1.2;Target=eleg...
2,I,inverted,inverted_repeat,1,212,66.0,.,.,Note=loop 426


Here we have the additional requirement of only using introns with known parent transcripts.

In [3]:
intron_annotations = ws253_annotations[(ws253_annotations[2].str.contains("intron")) & (ws253_annotations[8].str.contains("Transcript"))]
print(intron_annotations.shape)
intron_annotations.head(3)

(196886, 9)


Unnamed: 0,0,1,2,3,4,5,6,7,8
849,I,WormBase,intron,4359,5194,.,-,.,Parent=Transcript:Y74C9A.3;Note=Confirmed_EST ...
1033,I,WormBase,intron,5297,6036,.,-,.,Parent=Transcript:Y74C9A.3;Note=Confirmed_EST ...
1274,I,WormBase,intron,6328,9726,.,-,.,Parent=Transcript:Y74C9A.3;Note=Confirmed_EST ...


In [4]:
intron_annotations.to_csv("genomes/c_elegans.WS253.introns.gff3")

In [5]:
"""Converts locus from gff3 to bed. Note that gff3 format is 1 based and end inclusive, while bed is 0 based and end 
exclusive. So a gff3 format start:end of 1:100 would be equivalent to bed format start:end of 0:100. Here we substract
from each of the start position."""
def convert_bed(row):
    return f"{row[0]}\t{row[3]-1}\t{row[4]}\t{row[8]}\t.\t{row[6]}"

In [6]:
intron_bed = intron_annotations.apply(convert_bed, axis=1)

In [7]:
with open("genomes/ws253/ws253.intron.bed", "w") as out:
    for bed in intron_bed:
        bed = bed.strip('"') #Earlier had issue with extra quotations, this is just precautionary measure
        out.write(bed + "\n")

## Filter Introns
Here we filter out all the introns that overlap with known smRNAs and repeat regions/transposable elements.

In [8]:
filter_req = ["inverted_repeat","repeat_region", "snoRNA", "transposable_element_insertion_site",
            "piRNA", "transposable_element", "pseudogenic_tRNA", "pre_miRNA","snRNA", 
            "miRNA_primary_transcript", "pseudogenic_rRNA", "miRNA", "rRNA", "tRNA"]

In [9]:
filter_annot = ws253_annotations[np.isin(ws253_annotations, filter_req).any(axis=1)]

In [10]:
filter_bed = filter_annot.apply(convert_bed, axis=1)
with open("genomes/ws253/ws253.filter.bed", "w") as out:
    for bed in filter_bed:
        bed = bed.strip('"') #Earlier had issue with extra quotations.
        out.write(bed + "\n")

For this step we use bedtools intersect `-v` option to only keep loci in our intron bedfile that had no overlaps with our filter bedfile. We do not use stranded option here, so that as long as the coordinates match we will omit the locus.

In [11]:
%%bash
bedtools intersect -v -a genomes/ws253/ws253.intron.bed -b genomes/ws253/ws253.smRNA.bed > genomes/ws253/ws253.intron.filter.bed

# Done
Now that we prepared our filtered intron map, we can proceed with our anti-sense intron pipeline.