A pipline to transform raw read data in FASTQ format to clean contact data.
Requires FastUniq, Trimmomatic, Hisat2, SAMTools, BedTools.

Defining paths and variables

In [19]:
FastUniqPath=""
TrimmomaticPath=""
Hisat2Path=""
SamtoolsPath=""
BedtoolsPath=""
SRR1=""
SRR2=""
GenomePath=""
GenomeName=""
SpliceSitePath=""
ContactsOutput=""
CIGARIntersectionOutput=""
Hisat2ThreadNumber=
export PATH=$PATH:$BedtoolsPath

Checking file existences

In [2]:
if [ ! -x "$FastUniqPath" ]; then echo "Wrong FastUniq path"; exit 1; fi
if [ ! -x "$TrimmomaticPath" ]; then echo "Wrong Trimmomatic path"; exit 1; fi
if [ ! -d "$Hisat2Path" ]; then echo "Wrong Hisat2 path"; exit 1; fi
if [ ! -x "$SamtoolsPath" ]; then echo "Wrong Samtools path"; exit 1; fi
if [ ! -d "$BedtoolsPath" ]; then echo "Wrong Bedtools path"; exit 1; fi
if [ ! -e "${SRR1}".fastq ] || [ ! -e "${SRR2}".fastq ]; then echo "Wrong data path"; exit 1; fi
if [ ! -e "$GenomePath" ]; then echo "Wrong genome path"; exit 1; fi

The generously provided data (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6071448) have been linked to a batch script file which processes reads from Illumina converting to FASTQ format, removing linkers, transcripting RNA reads and filtering read length in interval [18, 23] base pairs. The link is https://github.com/ezioljj/scripts_for_GRID-seq. However, the data is likely to contain PCR duplicates, so for our pipeline diagram (https://rnachrom2.bioinf.fbb.msu.ru/protocol) it is considered as raw reads without linkers.

Now preparing a temporary file for FastUniq

In [34]:
echo -e "$(echo $SRR1).fastq\n$(echo $SRR2).fastq" > list.txt

In [35]:
head list.txt

SRR18961962.fastq
SRR18961963.fastq


Launching FastUniq to remove PCR duplicates

In [36]:
$FastUniqPath -i list.txt -o ${SRR1}_ND.fastq -p  ${SRR2}_ND.fastq 

Every read in FASTQ format contains a line starting with "@", but in some cases reads with low quality can have 2 lines, let's pretend number of reads is approximately equals the number of lines starting with symbol "@".

In [23]:
head -n8 ${SRR2}_ND.fastq

@SRR18961961.183215630 183215630/1
CACGCACCGCACGTTCGT
+
AAFFFJJJJJJJJFFFJF
@SRR18961961.96647001 96647001/1
GTGTCCCCTAAACCAAAGT
+
AAAFFFJJFFJJJJJJJJF


In [38]:
cat ${SRR1}.fastq | grep '^@' | wc -l

349796877


In [39]:
cat ${SRR2}.fastq | grep '^@' | wc -l

349796877


In [40]:
cat ${SRR1}_ND.fastq | grep '^@' | wc -l

235786010


In [41]:
cat ${SRR2}_ND.fastq | grep '^@' | wc -l

235786010


Addition of restriction sites is performed according to the main restriction enzyme site, which is 5'-AG'CT-3' for AluI. Reads with AG in 3' remain and CT string is added to the end, while quality is copied from the AG base pairs, other reads are withdrawn. However, we found our that there is an extra A at the 3' end, so we removed it.

Code for the addition is presented below:

In [57]:
cat rest_site.py

from sys import argv

def prt(m):
    if len(m) == 0 or m[1][-4:-1] != 'AGA':
        return
    print(m[0][:-1])
    print(m[1][:-2] + 'CT')
    if m[2][0] == '+':
        print('+')
        print(m[3][:-2] + m[3][-4:-2])
    else:
        print(m[2][:-2] + m[2][-4:-2])

f = open(argv[1], 'r')
m = []

for line in f:
    if line[0] == '@':
        if len(m) > 0:
            prt(m)
        m = []       
    m.append(line)
prt(m)

f.close()



In [60]:
python3 rest_site.py ${SRR1}_ND.fastq > ${SRR1}_ND_RS.fastq

In [20]:
cat ${SRR1}_ND_RS.fastq | grep '^@' | wc -l

65986562


Reads then are filtered to have length at least 14 base pairs and, window size 5 and quality 26 by Trimmomatic

In [65]:
WindowSize=5
Quality=26
MinLen=14

java -jar $TrimmomaticPath PE -phred33 ${SRR2}_ND.fastq ${SRR1}_ND_RS.fastq  ${SRR2}_ND_trimmed.fastq \
    ${SRR2}_ND_unpaired.fastq ${SRR1}_ND_RS_trimmed.fastq ${SRR1}_ND_RS_unpaired.fastq  \
    SLIDINGWINDOW:$WindowSize:$Quality MINLEN:$MinLen


TrimmomaticPE: Started with arguments:
 -phred33 show/SRR18961959_ND.fastq show/SRR18961958_ND_RS.fastq show/SRR18961959_ND_trimmed.fastq show/SRR18961959_ND_unpaired.fastq show/SRR18961958_ND_RS_trimmed.fastq show/SRR18961958_ND_RS_unpaired.fastq SLIDINGWINDOW:5:26 MINLEN:14
Input Read Pairs: 62827643 Both Surviving: 59486181 (94.68%) Forward Only Surviving: 1236703 (1.97%) Reverse Only Surviving: 2061891 (3.28%) Dropped: 42868 (0.07%)
TrimmomaticPE: Completed successfully


Using the swine genome (Sus11.1) we index it for further mapping

In [None]:
${Hisat2Path}hisat2-build -p$Hisat2ThreadNumber $GenomePath $GenomeName > /dev/null

Downloading GTF file to determine splicing sites

In [37]:
curl -L https://ftp.ensembl.org/pub/release-108/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.108.gtf.gz -O

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.7M  100 17.7M    0     0   893k      0  0:00:20  0:00:20 --:--:--  899k


In [38]:
gzip -d Sus_scrofa.Sscrofa11.1.108.gtf.gz

Creating splicing sites markdown

In [41]:
python3 ${Hisat2Path}hisat2_extract_splice_sites.py Sus_scrofa.Sscrofa11.1.108.gtf > splicesites.txt

Mapping to the reference genome (Sus11.1) using Hisat2 (DNA part)

In [66]:
${Hisat2Path}hisat2 -p$Hisat2ThreadNumber -x$GenomeName -k 100 \
--no-spliced-alignment --no-softclip \
-U ${SRR1}_ND_RS_trimmed.fastq | $SamtoolsPath view -bSh > ${SRR1}.bam 

59486181 reads; of these:
  59486181 (100.00%) were unpaired; of these:
    10343195 (17.39%) aligned 0 times
    37854974 (63.64%) aligned exactly 1 time
    11288012 (18.98%) aligned >1 times
82.61% overall alignment rate


Mapping to the reference genome (Sus11.1) using Hisat2 (RNA part)

In [67]:
${Hisat2Path}hisat2 -p$Hisat2ThreadNumber -x$GenomeName -k 100 --no-softclip --dta-cufflinks \
--known-splicesite-infile $SpliceSitePath \
--novel-splicesite-outfile ${SRR2}_novel_splice -U ${SRR2}_ND_trimmed.fastq | $SamtoolsPath view -bSh > ${SRR2}.bam

59486181 reads; of these:
  59486181 (100.00%) were unpaired; of these:
    41367129 (69.54%) aligned 0 times
    9764274 (16.41%) aligned exactly 1 time
    8354778 (14.04%) aligned >1 times
30.46% overall alignment rate


Getting number of reads mapped

In [68]:
$SamtoolsPath view -c -F 260 ${SRR1}.bam

49142986


In [69]:
$SamtoolsPath view -c -F 260 ${SRR2}.bam

18119052


In [3]:
$SamtoolsPath view -Sh -F 4 ${SRR1}.bam | $SamtoolsPath view -Sbh | $BedtoolsPath/bamToBed -cigar -i > ${SRR1}_unfiltered.bed
$SamtoolsPath view -Sh -F 4 ${SRR2}.bam | $SamtoolsPath view -Sbh | $BedtoolsPath/bamToBed -cigar -i > ${SRR2}_unfiltered.bed

In [None]:
python3 cnt_intersect.py ${SRR1}_unfiltered.bed ${SRR2}_unfiltered.bed unfiltered_intersected1.bed

Applying filter to reads. Only unique mappings with at most 2 mismatches remain

In [70]:
$SamtoolsPath view -Sh -F 4 ${SRR1}.bam | grep -E 'XM:i:[0-2]\s.*NH:i:1$|^@' | \
$SamtoolsPath view -Sbh > ${SRR1}_filtered.bam

In [71]:
$SamtoolsPath view -Sh -F 4 ${SRR2}.bam | grep -E 'XM:i:[0-2]\s.*NH:i:1$|^@' | \
$SamtoolsPath view -Sbh > ${SRR2}_filtered.bam

Counting filtered reads

In [72]:
$SamtoolsPath view -c -F 260 ${SRR1}_filtered.bam

37854974


In [73]:
$SamtoolsPath view -c -F 260 ${SRR2}_filtered.bam

9764274


Converting to BED format

In [75]:
$SamtoolsPath view -Sh -F 4 ${SRR1}_filtered.bam | $SamtoolsPath view -Sbh | $BedtoolsPath/bamToBed -cigar -i > ${SRR1}.bed

In [76]:
$SamtoolsPath view -Sh -F 4 ${SRR2}_filtered.bam | $SamtoolsPath view -Sbh | $BedtoolsPath/bamToBed -cigar -i > ${SRR2}.bed

Intersecting BED files and forming contacts

In [82]:
cat cnt_intersect.py

import pandas as pd
from sys import argv

dna = pd.read_csv(argv[1], sep = "\t", header = None)
rna = pd.read_csv(argv[2], sep = "\t", header = None)
dna[3] = dna[3].str.split('.').str[1]
rna[3] = rna[3].str.split('.').str[1]
dna.set_index([3], inplace=True)
rna.set_index([3], inplace=True)
ids = list(set(dna.index) & set(rna.index))                  
dna = dna.loc[ids,]
rna = rna.loc[ids,]
res = pd.DataFrame(index=range(len(ids)),\
columns=['id', 'rna_chr', 'rna_bgn', 'rna_end', 'rna_strand', 'rna_cigar',\
'dna_chr', 'dna_bgn', 'dna_end', 'dna_strand', 'dna_cigar'])
for i, item in enumerate(ids):
    res.at[i,'id'] = item
    res.at[i,'rna_chr'] = rna.at[item,0]
    res.at[i,'rna_bgn'] = rna.at[item,1]
    res.at[i,'rna_end'] = rna.at[item,2]
    res.at[i,'rna_strand'] = rna.at[item,5]
    res.at[i,'rna_cigar'] = rna.at[item,6]
    res.at[i,'dna_chr'] = dna.at[item,0]
    res.at[i,'dna_bgn'] = dna.at[item,1]
    res.at[i,'dna_end'] = dna.at[item,2]
    res.at[i,'dna_strand'] = dna.at[

In [77]:
python3 cnt_intersect.py ${SRR1}.bed ${SRR2}.bed $ContactsOutput

Counting contacts

In [78]:
wc -l $ContactsOutput

2768273 2rep_cnt.tsv


CIGAR filtering: reads with a complete match ('digitsM' flag) are left intact, reads with only one mismatch ('digitsMdigitsNdigitsM' type) are cut to their longest match and other reads are dropped

In [16]:
cat cigar_filter.py

import re
from sys import argv


f = open(argv[1], 'r')
f1 = open(argv[2], 'w')
linecnt = 0

c1 = re.compile('\d+M')
c2 = re.compile('(\d+)M(\d+)N(\d+)M')
cnt1 = 0
cnt2 = 0

while True:
    linecnt += 1
    line = f.readline()
    if not line:
        break
    llist = line.split()
    tline = llist[6]
    
    if len(c1.findall(tline)) == 1:
        cnt1 += 1
        print(line, end='', file=f1)
    elif len(c1.findall(tline)) == 2:
        cnt2 += 1
        l2 = c2.findall(tline)

        if int(l2[0][0]) > int(l2[0][2]):
            tmp = int(llist[1]) + int(l2[0][0])
            llist[2] = str(tmp)
            print('\t'.join(llist), end='', file=f1)
        else:
            tmp = int(llist[2]) - int(l2[0][2])
            llist[1] = str(tmp)
            print('\t'.join(llist), end='', file=f1)
f.close()
print(f'full match {cnt1}')
print(f'central gap {cnt2}')



In [None]:
python3 cigar_filter.py ${SRR2}.bed ${SRR2}_filtered.bed

Intersecting contacts

In [21]:
python3 cnt_intersect.py ${SRR1}.bed ${SRR2}_filtered.bed $CIGARIntersectionOutput

Counting contacts

In [23]:
wc -l $CIGARIntersectionOutput

2037345 4rep_cigar_out.tsv


Merging contacts from all replicas together

In [24]:
cat *cigar*tsv| grep -v rna_bgn > merged_before_rna.tsv

In [109]:
wc -l merged_before_rna.tsv

8955558 merged_before_rna.tsv


A script to transform GTF file into genes markdown

In [26]:
cat genes_tidy.py

from sys import argv


f = open(argv[1], 'r')
f1 = open(argv[2], 'w+')
linecnt = 0

while True:
    linecnt += 1  
    line = f.readline()
    if not line:
        break
    llist = line.split()
    try:
        if llist[2] == 'gene':
            print(f'chr{llist[0]}\t{llist[3]}\t{llist[4]}\t{llist[6]}\t{llist[17][1:-2]}\t{llist[13][1:-2]}\t{int(llist[4])-int(llist[3])}', file=f1)
    except:
        continue
f.close()
f1.close()


Transforming:

In [27]:
python3 genes_tidy.py Sus_scrofa.Sscrofa11.1.108.gtf genes.tsv

In [29]:
wc -l genes.tsv

17508 genes.tsv


RNA type distribution among the genes:

In [8]:
cat genes.tsv | awk '{print $5}' | sort | uniq -c

    367 miRNA
     17 misc_RNA
  15681 protein_coding
      7 ribozyme
     18 rRNA
     11 scaRNA
    318 snoRNA
   1074 snRNA
      3 TR_V_gene
      3 vault_RNA
      9 Y_RNA


A script to annotate RNA-parts of contacts

In [13]:
cat annotation.sh

# Creating temporary files
t1=$(mktemp)
t2=$(mktemp)

# Contacts and genes paths
ContactsFile='merged_before_rna.tsv'
GenesFile='genes.tsv'

# Intersection of contacts and genes, strands are stored in corresponding files; didn't find bedtools on HPC-2 so provided thast files
bedtools intersect -a <(cat $ContactsFile| awk '{print $2, $3, $4}'  OFS='\t') -b <(cat $GenesFile) -wb > intersected_genes.tsv
bedtools intersect -a <(cat $ContactsFile| awk '{print $2, $3, $4, $5, $1}'  OFS='\t') -b <(cat $GenesFile) -wa > intersected_contacts.tsv

# Counting how many contacts intersected certain gene = gene score
cat intersected_genes.tsv | awk '{print $9}' | sort | uniq -c > gene_score.csv

# Adding score column to the intersection from the genes perspective
join -1 9 -2 2 -o1.1,1.2,1.3,1.7,1.8,1.9,1.10,2.1 <(sort -k 9,9 intersected_genes.tsv) gene_score.csv > intersected_scored_genes.csv

# Dividing by gene length
cat intersected_scored_genes.csv | awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$8/$7}' >

In [14]:
./annotation.sh

Counting contacts

In [8]:
wc -l annotated_contacts.csv

5278072 annotated_contacts.csv


RNA biotype distribution after annotation

In [9]:
cat annotated_contacts.csv | sort -k12,12 |  awk '{print $11,$12}' | uniq -f1 | awk '{print $1}' | sort | uniq -c

    116 miRNA
      1 misc_RNA
  14837 protein_coding
      2 ribozyme
      1 rRNA
     10 scaRNA
    166 snoRNA
      8 snRNA
      1 vault_RNA
      1 Y_RNA


Contacts biotype distribution after annotation

In [10]:
cat annotated_contacts.csv  | awk '{print $11}' | sort | uniq -c

   2138 miRNA
     13 misc_RNA
5222517 protein_coding
   4487 ribozyme
      6 rRNA
   2452 scaRNA
  31919 snoRNA
  12764 snRNA
    131 vault_RNA
   1645 Y_RNA


Getting RNAs with the most contacts

In [11]:
cat annotated_contacts.csv | sort -k12,12 |  awk '{print $11,$12}' | uniq -c  > rna_cnt_number1.csv

In [12]:
cat rna_cnt_number1.csv | sort -k1nr,1n | head -n10 

 102029 protein_coding TSEN2
  41933 protein_coding NEB
  37658 protein_coding MBNL1
  35166 protein_coding ZBTB16
  24259 protein_coding LRRC43
  20425 protein_coding MYO18B
  20265 protein_coding AUTS2
  19497 protein_coding CACNA2D1
  17672 protein_coding TRDN
  17436 protein_coding TEX14
sort: write failed: 'standard output': Broken pipe
sort: write error


Attempting to prepare data for normalization

In [282]:
grep protein_coding contacts_rna_annotated1.csv > pcrna.csv

In [288]:
grep -v protein_coding contacts_rna_annotated1.csv > npcrna.csv

In [290]:
head npcrna.csv

chr6 71971138 71971157 + 19M chr13 124590001 124590020 - 19M snRNA U5 19686 100002016
chr14 40286509 40286528 - 19M chr6 100028477 100028496 + 19M snRNA U4 8808 100004528
chr1 236380016 236380035 - 19M chr18 11678219 11678239 + 20M ribozyme RNase_MRP 1300 100031442
chr14 40286508 40286527 - 19M chr11 13656391 13656410 - 19M snRNA U4 8808 100036647
chr1 163335027 163335046 + 19M chr14 48515712 48515731 - 19M snRNA U5 19686 100043587
chr14 40286550 40286569 - 19M chr13 199193288 199193307 - 19M snRNA U4 8808 100059528
chr14 40286550 40286569 - 19M chr6 54532506 54532526 + 20M snRNA U4 8808 100070833
chr14 40286550 40286569 - 19M chr6 54532506 54532526 + 20M snRNA U4 8808 100070833
chr6 54567220 54567239 + 19M chr18 16638200 16638219 + 19M snoRNA SNORD33 1228 100084671
chr6 71971138 71971157 + 19M chr1 118912052 118912071 + 19M snRNA U5 19686 100087314


In [310]:
cat pcrna.csv | awk '{print $12}' | sort | uniq -c | sort -k1nr,1 | tail -n+51 | head -n -1000 > fpcrna

In [311]:
wc -l fpcrna

13911 fpcrna


In [314]:
join -1 12 -2 2 <(cat contacts_rna_annotated1.csv | sort -k12,12) <(cat fpcrna | sort -k2,2) > fpcrna_contacts.csv

In [315]:
head fpcrna_contacts.csv

A1CF chr14 98909017 98909036 + 19M chr14 98908799 98908818 - 19M protein_coding 54 245788663 54
A1CF chr14 98910836 98910855 - 19M chr14 98910543 98910562 - 19M protein_coding 54 146253771 54
A1CF chr14 98911368 98911387 + 19M chr16 73406426 73406445 + 19M protein_coding 54 232813212 54
A1CF chr14 98911368 98911387 + 19M chr16 73406426 73406445 + 19M protein_coding 54 232813212 54
A1CF chr14 98913970 98913990 + 20M chr14 98914095 98914115 + 20M protein_coding 54 145406469 54
A1CF chr14 98913970 98913990 + 20M chr14 98914095 98914115 + 20M protein_coding 54 145406469 54
A1CF chr14 98919267 98919286 - 19M chr14 49314303 49314323 + 20M protein_coding 54 95062353 54
A1CF chr14 98919352 98919371 - 19M chr14 99673719 99673738 + 19M protein_coding 54 161146852 54
A1CF chr14 98920675 98920694 - 19M chr14 98920575 98920595 - 20M protein_coding 54 144115271 54
A1CF chr14 98929193 98929212 + 19M chr14 98929419 98929439 + 20M protein_coding 54 8849515 54


In [321]:
cat fpcrna_contacts.csv | awk '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$1,$13,$14,$15}' OFS='\t' > fpcrna_contacts1.csv

In [322]:
head fpcrna_contacts1.csv

chr14	98909017	98909036	+	19M	chr14	98908799	98908818	-	19M	protein_coding	A1CF	54	245788663	54
chr14	98910836	98910855	-	19M	chr14	98910543	98910562	-	19M	protein_coding	A1CF	54	146253771	54
chr14	98911368	98911387	+	19M	chr16	73406426	73406445	+	19M	protein_coding	A1CF	54	232813212	54
chr14	98911368	98911387	+	19M	chr16	73406426	73406445	+	19M	protein_coding	A1CF	54	232813212	54
chr14	98913970	98913990	+	20M	chr14	98914095	98914115	+	20M	protein_coding	A1CF	54	145406469	54
chr14	98913970	98913990	+	20M	chr14	98914095	98914115	+	20M	protein_coding	A1CF	54	145406469	54
chr14	98919267	98919286	-	19M	chr14	49314303	49314323	+	20M	protein_coding	A1CF	54	95062353	54
chr14	98919352	98919371	-	19M	chr14	99673719	99673738	+	19M	protein_coding	A1CF	54	161146852	54
chr14	98920675	98920694	-	19M	chr14	98920575	98920595	-	20M	protein_coding	A1CF	54	144115271	54
chr14	98929193	98929212	+	19M	chr14	98929419	98929439	+	20M	protein_coding	A1CF	54	8849515	54


In [318]:
head npcrna.csv

chr6 71971138 71971157 + 19M chr13 124590001 124590020 - 19M snRNA U5 19686 100002016
chr14 40286509 40286528 - 19M chr6 100028477 100028496 + 19M snRNA U4 8808 100004528
chr1 236380016 236380035 - 19M chr18 11678219 11678239 + 20M ribozyme RNase_MRP 1300 100031442
chr14 40286508 40286527 - 19M chr11 13656391 13656410 - 19M snRNA U4 8808 100036647
chr1 163335027 163335046 + 19M chr14 48515712 48515731 - 19M snRNA U5 19686 100043587
chr14 40286550 40286569 - 19M chr13 199193288 199193307 - 19M snRNA U4 8808 100059528
chr14 40286550 40286569 - 19M chr6 54532506 54532526 + 20M snRNA U4 8808 100070833
chr14 40286550 40286569 - 19M chr6 54532506 54532526 + 20M snRNA U4 8808 100070833
chr6 54567220 54567239 + 19M chr18 16638200 16638219 + 19M snoRNA SNORD33 1228 100084671
chr6 71971138 71971157 + 19M chr1 118912052 118912071 + 19M snRNA U5 19686 100087314


In [319]:
cat npcrna.csv | awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14}' OFS='\t' > npcrna1.tsv

In [320]:
head npcrna1.tsv

chr6	71971138	71971157	+	19M	chr13	124590001	124590020	-	19M	snRNA	U5	19686	100002016
chr14	40286509	40286528	-	19M	chr6	100028477	100028496	+	19M	snRNA	U4	8808	100004528
chr1	236380016	236380035	-	19M	chr18	11678219	11678239	+	20M	ribozyme	RNase_MRP	1300	100031442
chr14	40286508	40286527	-	19M	chr11	13656391	13656410	-	19M	snRNA	U4	8808	100036647
chr1	163335027	163335046	+	19M	chr14	48515712	48515731	-	19M	snRNA	U5	19686	100043587
chr14	40286550	40286569	-	19M	chr13	199193288	199193307	-	19M	snRNA	U4	8808	100059528
chr14	40286550	40286569	-	19M	chr6	54532506	54532526	+	20M	snRNA	U4	8808	100070833
chr14	40286550	40286569	-	19M	chr6	54532506	54532526	+	20M	snRNA	U4	8808	100070833
chr6	54567220	54567239	+	19M	chr18	16638200	16638219	+	19M	snoRNA	SNORD33	1228	100084671
chr6	71971138	71971157	+	19M	chr1	118912052	118912071	+	19M	snRNA	U5	19686	100087314


In [323]:
cat fpcrna_contacts1.csv npcrna1.tsv > filtered_mrna_contacts.tsv