This notebook contains code related to downloading the dataset, and converting it into a list of tuples $(h,r,t)$ (corresponding to head, relation, tail). The list is stored in `data/dataset.tsv`. Next to it, we store lookup tables for the types of targets and regulators. (`data/target2type.csv` and `data/regulator2type.csv` respectively.)

In [1]:
from dataset import load_lnctard, df2nx
import pandas as pd

### 🛒  Download and load dataset

In [2]:
!mkdir -p data
!wget https://lnctard.bio-database.com/downloadfile/lnctard2.0.zip -qO- | zcat > data/lnctard2.0.txt

load raw dataset, for some reason `utf-8` does not work for decoding, but `latin-1` seems to work

In [2]:
df = load_lnctard()

In [3]:
df.head()

Unnamed: 0,Regulator,Target,SearchregulatoryMechanism,RegulatorType,TargetType
0,LINC00313,miR-4429,ceRNA or sponge,lncRNA,miRNA
1,FAM83H-AS1,CDKN1A,epigenetic regulation,lncRNA,PCG
2,NEAT1,TGFB1,ceRNA or sponge,lncRNA,PCG
3,NEAT1,ZEB1,ceRNA or sponge,lncRNA,TF
4,ZFPM2-AS1,MIF,interact with protein,lncRNA,PCG


extract largest graph connection component

In [5]:
largest_cc = df2nx(
  df, head="Regulator", tail="Target",
  relation="SearchregulatoryMechanism",
  cc_mode="largest",
)

### 🛍️ Extract tuples and store dataset

In [6]:
# create tuples (h,r,t)
edgedata = largest_cc.edges.data("SearchregulatoryMechanism")
tuples = [(h,r,t) for h,t,r in edgedata] # swizzle t and r
tuples = pd.DataFrame(tuples, columns=["head","relation","tail"])
print("gathered",len(tuples),"tuples")
tuples.head()

gathered 6773 tuples


Unnamed: 0,head,relation,tail
0,LINC00313,ceRNA or sponge,miR-4429
1,LINC00313,transcriptional regulation,SOX2
2,LINC00313,ceRNA or sponge,MIR422A
3,LINC00313,ceRNA or sponge,FOSL2
4,LINC00313,epigenetic regulation,ALX4


In [7]:
tuples.to_csv("data/dataset.tsv", sep="\t", index=False)

obtain from targets to target types and regulators to regulator types.

In [8]:
target2type = df[["Target","TargetType"]].drop_duplicates(ignore_index=True)
target2type.head()

Unnamed: 0,Target,TargetType
0,miR-4429,miRNA
1,CDKN1A,PCG
2,TGFB1,PCG
3,ZEB1,TF
4,MIF,PCG


In [9]:
regulator2type = df[["Regulator","RegulatorType"]].drop_duplicates(ignore_index=True)
regulator2type.head()

Unnamed: 0,Regulator,RegulatorType
0,LINC00313,lncRNA
1,FAM83H-AS1,lncRNA
2,NEAT1,lncRNA
3,ZFPM2-AS1,lncRNA
4,SNHG1,lncRNA


store mappings

In [10]:
target2type.to_csv("data/target2type.tsv", sep="\t", index=False)
regulator2type.to_csv("data/regulator2type.tsv", sep="\t", index=False)