**Iris Lee and Danny Collinson**

**Assignment 2**

We are planning on working with miRTarBase, a database of interactions between microRNA (miRNA) and their targets.

https://academic.oup.com/nar/article/46/D1/D296/4595852?login=true
https://mirtarbase.cuhk.edu.cn/~miRTarBase/miRTarBase_2022/php/index.php

To do so, we are going to use the version of the data provided by Therapeutics Data Commons.

https://tdcommons.ai/multi_pred_tasks/mti/#mirtarbase

The TDC data is conveniently loaded into Pandas dataframes using the TDC Python package, which will download the data locally (338 MB) and can load it into a set of three dataframes for training, validation, and testing.

In [1]:
# Run once
!pip install PyTDC



In [2]:
import numpy as np
import pandas as pd
from tdc.multi_pred import MTI

In [3]:
data = MTI(name = 'miRTarBase')
split = data.get_split()

Found local copy...
Loading...
Done!


split is a dictionary containing the three dataframes.

In [4]:
split.keys()

dict_keys(['train', 'valid', 'test'])

Each value in split is a dataframe with the miRNA sequence and the target protein. All of these are positive pairs.

In [5]:
split['train']

Unnamed: 0,miRNA_ID,miRNA,Target_ID,Target,Y
0,ath-miR398c-3p,UGUGUUCUCAGGUCACCCCUG,817365,MAATNTILAFSSPSRLLIPPSSNPSTLRSSFRGVSLNNNNLHRLQS...,1
1,ath-miR398b-3p,UGUGUUCUCAGGUCACCCCUG,817365,MAATNTILAFSSPSRLLIPPSSNPSTLRSSFRGVSLNNNNLHRLQS...,1
2,ath-miR398b-3p,UGUGUUCUCAGGUCACCCCUG,837405,MAKGVAVLNSSEGVTGTIFFTQEGDGVTTVSGTVSGLKPGLHGFHV...,1
3,ath-miR398a-3p,UGUGUUCUCAGGUCACCCCUU,817365,MAATNTILAFSSPSRLLIPPSSNPSTLRSSFRGVSLNNNNLHRLQS...,1
4,ath-miR398a-3p,UGUGUUCUCAGGUCACCCCUU,837405,MAKGVAVLNSSEGVTGTIFFTQEGDGVTTVSGTVSGLKPGLHGFHV...,1
...,...,...,...,...,...
280053,mmu-miR-146a-5p,UGAGAACUGAAUUCCAUGGGUU,1489832,MDNLTKVREYLKSYSRLDQAVGEIDEIEAQRAEKSNYELFQEDGVE...,1
280054,mmu-miR-93-5p,CAAAGUGCUGUUCGUGCAGGUAG,1489832,MDNLTKVREYLKSYSRLDQAVGEIDEIEAQRAEKSNYELFQEDGVE...,1
280055,mmu-miR-378a-5p,CUCCUGACUCCAGGUCCUGUGU,1489832,MDNLTKVREYLKSYSRLDQAVGEIDEIEAQRAEKSNYELFQEDGVE...,1
280056,mmu-miR-24-3p,UGGCUCAGUUCAGCAGGAACAG,1489835,MEVHDFETDEFNDFNEDDYATREFLNPDERMTYLNHADYNLNSPLI...,1


In [6]:
split['valid']

Unnamed: 0,miRNA_ID,miRNA,Target_ID,Target,Y
0,hsa-miR-4728-5p,UGGGAGGGGAGAGGCAGCAAGCA,246175,MRLIGMPKEKYDPPDPRRIYTIMSAEEVANGKKSHWAELEISGRVR...,1
1,hsa-let-7g-3p,CUGUACAGGCCACUGCCUUGC,55147,MASDDFDIVIEAMLEAPYKKEEDEQQRKEVKKDYPSNTTSSTSNSG...,1
2,hsa-miR-450b-5p,UUUUGCAAUAUGUUCCUGAAUA,63926,MALADKRLENLQIYKVLQCVRNKDKKQIEKLTKLGYPELINYTEPI...,1
3,hsa-miR-7158-3p,CUGAACUAGAGAUUGGGCCCA,157777,MSNLKMKEAALIYLDRSGGLQKFIDDCKYYNDSKQSYAVYRFKILI...,1
4,hsa-miR-4319,UCCCUGAGCAAAGCCAC,55180,MKVFCEVLEELYKKVLLGATLENDSHDYIFYLNPAVSDQDCSTATS...,1
...,...,...,...,...,...
40003,hsa-miR-6894-5p,AGGAGGAUGGAGAGCUGGGCCAGA,5982,MEVEAVCGGAGEVEAQDSDPAPAFSKAPGSAGHYELPWVEKYRPVK...,1
40004,hsa-miR-106a-5p,AAAAGUGCUUACAGUGCAGGUAG,282996,MVLAAAMSQDADPSGPEQPDRVACSVPGARASPAPSGPRGMQQPPP...,1
40005,hsa-miR-8054,GAAAGUACAGAUCGGAUGGGU,221937,MAEVGEDSGARALLALRSAPCSPVLCAAAAAAAFPAAAPPPAPAQP...,1
40006,hsa-miR-3124-3p,ACUUUCCUCACUCCCGUGAAGU,375035,MDKLKKVLSGQDTEDRSGLSEVVEASSLSWSTRIKGFIACFAIGIL...,1


The documentation tells us that to get negative samples, call

`data = data.neg_sample(frac = 1)`

However, the TDC package code is older and uses df.append, which is not allowed in the updated version of pandas. To fix this, run

`from tdc import utils`

``utils.NegSample??``

To find the source file for NegSample, the function where the append error is. Search for the two places that use df.append, and change them to the block below (or something of this format)

```
df = pd.concat([df,
            pd.DataFrame(neg_list_val).rename(
                columns={0: id1, 1: x1, 2: id2, 3: x2, 4: "Y"}
            )]
        ).reset_index(drop=True)
        return df

```
Restart this notebook's kernel to reload the new version of the script. Then, generate the negative samples below

In [7]:
neg_data = data.neg_sample()

In [8]:
neg_data_split = neg_data.get_split()

In [9]:
neg_data_split['test']

Unnamed: 0,miRNA_ID,miRNA,Target_ID,Target,Y
0,hsa-miR-3609,CAAAGUGAUGAGUAAUACUGGCUG,8614,MCAERLGQFMTLALVLATFDPARGTDATNPPEGPQDRSSQQKGRLS...,1
1,hsa-miR-1269a,CUGGACUGAGCCGUGCUACUGG,30001,MGRGWGFLFGLLGAVWLLSSGHGEEQPPETAAQRCFCQVSGYLDDC...,0
2,hsa-miR-7162-3p,UCUGAGGUGGAACAGCAGC,14634,MEAQAHSSTATERKKAENSIGKCPTRTDVSEKAVASSTTSNEDESP...,0
3,hsa-miR-210-3p,CUGUGCGUGUGACAGCGGCUGA,12223,MDPTAPGSSVSSLPLLLVLALGLAILHCVVADGNTTRTPETNGSLC...,0
4,hsa-miR-1180-5p,GGACCCACCCGGCCGGGAAUA,4286,MQSESGIVPDFEVGEEFHEEPKTYYELKSQPLKSSSSAEHPGASKP...,0
...,...,...,...,...,...
160028,hsa-miR-4279,CUCUCCUCCCGGCUUC,56142,MVFTPEDRLGKQCLLLPLLLLAAWKVGSGQLHYSVPEEAKHGTFVG...,1
160029,hsa-miR-21-3p,CAACACCAGUCGAUGGGCUGU,51062,MAKNRRDRNSWGGFSEKTYEWSSEEEEPVKKAGPVQVLIVKDDHSF...,0
160030,hsa-miR-548ac,CAAAAACCGGCAAUUACUUUUG,5534,MGNEASYPLEMCSHFDADEIKRLGKRFKKLDLDNSGSLSVEEFMSL...,1
160031,mmu-miR-34b-5p,AGGCAGUGUAAUUAGCUGAUUGU,53602,MGKQNSKLRPEVLQDLREHTEFTDHELQEWYKGFLKDCPTGHLTVD...,1


Y = 0 for negative samples, and Y = 1 for positive samples.

In [10]:
len(neg_data_split['train']), len(neg_data_split['train'][neg_data_split['train'].Y == 1])

(560115, 280042)

The splits have nearly half positive and half negative samples, and the training set has about 560,000 total samples, which should be enough to train a fairly large model, and other miRNA-target interaction databases exist online, although further investigation is needed to determine the amount of overlap and discrepancy between these other databases and miRTarBase.

Our goal is to predict if an arbitrary miRNA will have a positive interaction or will not interact with an arbitrary protein given the miRNA and protein seqi. This would support the development of miRNA therapeutics that target a desired protein, allow for experimental manipulations to better understand disease mechanisms, or enable the discovery of disease biomarkers. TDC also hosts a competition for performance on this dataset, which gives us an interesting opportunity to benchmark our own performance against known good models.

While the dataset gives binary scores for interactions, a potential extension would be to generate activity scores for each given interaction instead of the misleading yes/no binary, since magnitude of interaction and effect exist on the whole spectrum between strong positive interaction and non-interaction. This goal would likely benefit from the use of additional datasets with more detailed interaction data.