## Analyze your relation file

In this jupyter notebook, we will build a graph based on your relation file and do some analysis on it. Such as the number of nodes, the number of edges, the number of subgraphs, and so on. Based on the metrics, you can know whether your relation file is valid for training or not. If your relation file have too many subgraphs and no any subgraph is large enough (e.g. the percent of the number of nodes and edges in a subgraph is no more than 90% of the total number of nodes and edges in the graph.), you may need to consider to add more relations to your relation file.

In our opinion, the number of subgraphs should be as small as possible, and the number of nodes and edges in a subgraph should be as large as possible. In this way, the model can learn more information from the graph.

## Prepare your relation file

Prepare your relation file and specify the path in the following cell. The relation file should be a csv/tsv file and the first line should be the header. For the format of the entity & relation file, please refer to the [README](../graph_data/README.md). If you want to build your own entity & relation file, please refer to the [KG README](../graph_data/KG_README.md) for more details.

We assume that the relation file is named as `knowledge_graph.tsv`, the entity file is named as `annotated_entities.tsv`, and are located in the `datasets` directory or `you can specify the path in the following cell`.

In [36]:
import os

datadir = os.path.join(os.path.dirname(os.getcwd()), "datasets", "biomedgps-v20241115-134f92")
# datadir = "/var/folders/4s/d4nr1sg91ps1k3qz00h28w_r0000gp/T/tmpregc5oy4/"
graph_data_dir = os.path.join(os.path.dirname(os.getcwd()), "graph_data")

In [37]:
import subprocess
relation_file = os.path.join(datadir, "knowledge_graph.tsv")
entities_file = os.path.join(datadir, "knowledge_graph_entities.tsv")

print("Checking relation file: {}".format(relation_file))
if not os.path.exists(relation_file) and os.path.exists(relation_file + ".zip"):
    subprocess.check_output(["unzip", relation_file + ".zip", "-d", datadir])

if not os.path.exists(relation_file):
    raise FileNotFoundError("Relation file not found: {}".format(relation_file))

print("Checking entity file: {}".format(entities_file))
if not os.path.exists(entities_file) and os.path.exists(entities_file + ".zip"):
    subprocess.check_output(["unzip", entities_file + ".zip", "-d", datadir])

if not os.path.exists(entities_file):
    raise FileNotFoundError("Entity file not found: {}".format(entities_file))

Checking relation file: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/datasets/biomedgps-v20241115-134f92/knowledge_graph.tsv
Checking entity file: /Users/jy006/Documents/Code/BioMedGPS/biomedgps-data/datasets/biomedgps-v20241115-134f92/knowledge_graph_entities.tsv


## Dependencies

We defined all related functions in `lib/graph.py` module. Before doing the graph analysis, we need to import the module. In addition, we assume that you have followed the instructions in the [README](../README.md) file and have installed all the required dependencies.

In [38]:
import os
import sys

libdir = os.path.join(os.path.dirname(os.getcwd()), "lib")
sys.path.append(libdir)

from graph import (
    get_num_nodes,
    get_num_edges,
    get_num_subgraphs,
    create_graph,
    get_subgraph,
)

## Build a undirected graph from the data

In [4]:
G = create_graph(
    relation_file,
    entity_file=entities_file,
    directed=False,
    allow_multiple_edges=True,
)
directed_G = create_graph(
    relation_file, entity_file=entities_file, directed=True, allow_multiple_edges=True
)

### How many nodes, edges, and subgraphs are there in the graph?

In [5]:
get_num_nodes(G), get_num_edges(G), get_num_subgraphs(G)

(146969, 13842104, 69)

In [6]:
get_num_nodes(directed_G), get_num_edges(directed_G), get_num_subgraphs(directed_G)

(146969, 13842104, 69)

### How many nodes and edges are related to a subgraph which starts with our target node?

In [7]:
# We assume that our target node is ME/CFS, the node id is MONDO:0005404 (see entities.tsv) and the node type is Disease.
disease = ("MONDO:0005404", "Disease")

subgraph = get_subgraph(G, start_node=disease)

get_num_nodes(subgraph), get_num_edges(subgraph)

(146808, 13841916)

### Distribution of Relationship Types in the Graph

In [9]:
from collections import Counter
import pandas as pd
import plotly.express as px

knowledge_graph = pd.read_csv(relation_file, sep="\t", low_memory=False)
# relation_types = [data["relation"] for u, v, data in G.edges(data = True)]
# formatted_relation_types = [data["formatted_relation"] for u, v, data in G.edges(data = True)]
relation_types = knowledge_graph["relation_type"].values
formatted_relation_types = knowledge_graph["formatted_relation_type"].values if "formatted_relation_type" in knowledge_graph else knowledge_graph["relation_type"].values
relation_counts = Counter(relation_types)
formatted_relation_counts = Counter(formatted_relation_types)

relation_type_df = pd.DataFrame.from_dict(relation_counts, orient="index").reset_index()
relation_type_df.columns = ["Relationship Type", "Count"]

formatted_relation_type_df = pd.DataFrame.from_dict(
    formatted_relation_counts, orient="index"
).reset_index()
formatted_relation_type_df.columns = ["Formatted Relationship Type", "Count"]

In [10]:
relation_type_df

Unnamed: 0,Relationship Type,Count
0,BioMedGPS::SideEffect::Compound:Disease,85590
1,BioMedGPS::SideEffect::Compound:Phenotype,85386
2,GNBR::Y::Gene:Disease,2149
3,GNBR::T::Compound:Disease,46226
4,GNBR::Pa::Compound:Disease,2191
...,...,...
155,PrimeKG::interacts_with::Gene:Pathway,42507
156,PrimeKG::parent-child::Pathway:Pathway,5060
157,PrimeKG::expression_present::Gene:Anatomy,1518060
158,PrimeKG::parent-child::Anatomy:Anatomy,27996


In [11]:
formatted_relation_type_df

Unnamed: 0,Formatted Relationship Type,Count
0,BioMedGPS::SideEffect::Compound:Disease,85590
1,BioMedGPS::SideEffect::Compound:Phenotype,85386
2,GNBR::Y::Gene:Disease,2149
3,GNBR::T::Compound:Disease,46226
4,GNBR::Pa::Compound:Disease,2191
...,...,...
155,PrimeKG::interacts_with::Gene:Pathway,42507
156,PrimeKG::parent-child::Pathway:Pathway,5060
157,PrimeKG::expression_present::Gene:Anatomy,1518060
158,PrimeKG::parent-child::Anatomy:Anatomy,27996


In [12]:
fig = px.bar(
    relation_type_df.sort_values("Count", ascending=False),
    x="Relationship Type",
    y="Count",
    title="Distribution of Relationship Types in the Graph",
)

fig.show()

In [13]:
fig = px.bar(
    formatted_relation_type_df.sort_values("Count", ascending=False),
    x="Formatted Relationship Type",
    y="Count",
    title="Distribution of Formatted Relationship Types in the Graph",
)

fig.show()

In [14]:
formatted_relation_type_df.sort_values("Count", ascending=False, inplace=True)
formatted_relation_type_df

Unnamed: 0,Formatted Relationship Type,Count
119,PrimeKG::synergistic_interaction::Compound:Com...,2612146
157,PrimeKG::expression_present::Gene:Anatomy,1518060
133,PrimeKG::expression_present::Anatomy:Gene,1518060
52,DRUGBANK::ddi-interactor-in::Compound:Compound,1355733
128,PrimeKG::ppi::Gene:Gene,642150
...,...,...
25,BioMedGPS::AssociatedWith::CellularComponent:D...,2
16,BioMedGPS::AssociatedWith::Disease:CellularCom...,2
30,DRUGBANK::treats::Metabolite:Disease,1
35,BioMedGPS::E-::Disease:Pathway,1


In [15]:
formatted_relation_type_df[
    formatted_relation_type_df["Formatted Relationship Type"].str.endswith("Compound:Compound")
].sort_values("Count", ascending=False)

Unnamed: 0,Formatted Relationship Type,Count
119,PrimeKG::synergistic_interaction::Compound:Com...,2612146
52,DRUGBANK::ddi-interactor-in::Compound:Compound,1355733
53,Hetionet::CrC::Compound:Compound,6455


In [16]:
formatted_relation_type_df[
    formatted_relation_type_df["Formatted Relationship Type"].str.contains("Anatomy")
].sort_values("Count", ascending=False)

Unnamed: 0,Formatted Relationship Type,Count
157,PrimeKG::expression_present::Gene:Anatomy,1518060
133,PrimeKG::expression_present::Anatomy:Gene,1518060
63,Hetionet::AeG::Anatomy:Gene,526407
64,Hetionet::AdG::Anatomy:Gene,102240
65,Hetionet::AuG::Anatomy:Gene,97848
158,PrimeKG::parent-child::Anatomy:Anatomy,27996
136,PrimeKG::expression_absent::Anatomy:Gene,19884
159,PrimeKG::expression_absent::Gene:Anatomy,19884
33,Hetionet::DlA::Disease:Anatomy,3638
34,Hetionet::Aa::Anatomy:Disease,1


In [17]:
formatted_relation_type_df[
    formatted_relation_type_df["Formatted Relationship Type"].str.match(".*?::Gene:(BiologicalProcess|CellularComponent|MolecularFunction|Pathway)")
].sort_values("Count", ascending=False)

Unnamed: 0,Formatted Relationship Type,Count
109,Hetionet::GpBP::Gene:BiologicalProcess,542255
140,PrimeKG::interacts_with::Gene:BiologicalProcess,142722
110,Hetionet::GpMF::Gene:MolecularFunction,94202
143,PrimeKG::interacts_with::Gene:CellularComponent,77445
111,Hetionet::GpCC::Gene:CellularComponent,71563
142,PrimeKG::interacts_with::Gene:MolecularFunction,68383
155,PrimeKG::interacts_with::Gene:Pathway,42507
118,Hetionet::GpPW::Gene:Pathway,26598


In [18]:
formatted_relation_type_df[
    formatted_relation_type_df["Formatted Relationship Type"].str.contains("Gene:Gene")
].sort_values("Count", ascending=False)

Unnamed: 0,Formatted Relationship Type,Count
128,PrimeKG::ppi::Gene:Gene,642150
58,STRING::REACTION::Gene:Gene,400426
60,STRING::CATALYSIS::Gene:Gene,343533
61,STRING::BINDING::Gene:Gene,315875
59,STRING::OTHER::Gene:Gene,310684
75,Hetionet::Gr>G::Gene:Gene,265672
82,Hetionet::GiG::Gene:Gene,147164
62,INTACT::PHYSICAL ASSOCIATION::Gene:Gene,129311
76,INTACT::ASSOCIATION::Gene:Gene,112369
83,STRING::ACTIVATION::Gene:Gene,81355


In [19]:
formatted_relation_type_df[
    formatted_relation_type_df["Formatted Relationship Type"].str.contains("Disease:Disease")
].sort_values("Count", ascending=False)

Unnamed: 0,Formatted Relationship Type,Count
148,PrimeKG::parent-child::Disease:Disease,219790
19,Hetionet::DrD::Disease:Disease,661


### Entities in different species

In [20]:
## Number of Mouse / Rat / Human Genes
entities = pd.read_csv(entities_file, sep="\t")
genes = entities[entities["label"] == "Gene"]
mouse_genes = genes[genes["taxid"] == 10090]
rat_genes = genes[genes["taxid"] == 10116]
human_genes = genes[genes["taxid"] == 9606]

print("Number of Entities: ", len(mouse_genes), len(rat_genes), len(human_genes))
knowledge_graph = pd.read_csv(relation_file, sep="\t")
mouse_relations = knowledge_graph[
    knowledge_graph["source_id"].isin(mouse_genes["id"])
    | knowledge_graph["target_id"].isin(mouse_genes["id"])
]

human_relations = knowledge_graph[
    knowledge_graph["source_id"].isin(human_genes["id"])
    | knowledge_graph["target_id"].isin(human_genes["id"])
]

len(mouse_relations), len(human_relations)

Number of Entities:  5324 0 27833



Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.



(40848, 8902862)

### Distribution of Entities in the Graph

In [22]:
from collections import Counter
import pandas as pd

source_entities = knowledge_graph[["source_id", "source_type"]].rename(
    columns={"source_id": "entity_id", "source_type": "entity_type"}
)
target_entities = knowledge_graph[["target_id", "target_type"]]
target_entities.columns = ["entity_id", "entity_type"]
entities = pd.concat([source_entities, target_entities], axis=0).drop_duplicates()
entity_counts = Counter(entities["entity_type"])

entity_df = pd.DataFrame.from_dict(entity_counts, orient="index").reset_index()
entity_df.columns = ["Entity Type", "Count"]

entity_df

Unnamed: 0,Entity Type,Count
0,Compound,15482
1,Gene,33232
2,Disease,28921
3,BiologicalProcess,27835
4,CellularComponent,4043
5,Metabolite,35
6,Pathway,3342
7,Anatomy,14015
8,PharmacologicClass,345
9,MolecularFunction,11083


In [23]:
import plotly.express as px

fig = px.bar(
    entity_df,
    x="Entity Type",
    y="Count",
    title="Distribution of Entity Types in the Graph",
)

fig.show()

### Distribution of the number of edges of each node

In [27]:
import pandas as pd
import networkx as nx
import math
import matplotlib.pyplot as plt

# 假设 G 是已有的图对象
degree_sequence = dict(G.degree())
node_names = nx.get_node_attributes(G, "name")

# 创建一个 DataFrame，包含节点名称、度数和节点类型
degree_data = [
    (
        f"{n}-{node_names.get(n, 'Unknown')}",
        degree_sequence[n],
        G.nodes[n].get("type", "Unknown"),
    )
    for n in G.nodes
]

grouped_entity_df = pd.DataFrame(
    degree_data, columns=["Node Name", "Degree", "Node Type"]
)

# 找到 Degree 列的最大值
max_degree = grouped_entity_df["Degree"].max()

# 定义 bins 和 labels
step = 100  # 设置每个 bin 的步长
bins = list(range(0, int(math.ceil(max_degree / step)) * step + step, step))
labels = [f"{bins[i]}-{bins[i + 1] - 1}" for i in range(len(bins) - 1)]
labels[-1] = f"{bins[-2]}+"  # 最后一个标签表示最大范围

# 将 Degree 分配到 Category 中
grouped_entity_df["Category"] = pd.cut(
    grouped_entity_df["Degree"], bins=bins, labels=labels, right=False
)

# 按 Category 分组，统计每个组别中的节点数量
category_counts = grouped_entity_df["Category"].value_counts().sort_index()

category_counts_df = category_counts.reset_index()
category_counts_df.columns = ["Category", "Node Count"]

category_counts_df

In [31]:
degree_df = pd.DataFrame(degree_sequence.items(), columns=["Node Name", "Degree"])
degree_df["Node Name"] = degree_df["Node Name"].apply(lambda x: f"{x[0]}-{x[1]}")
degree_df

Unnamed: 0,Node Name,Degree
0,DrugBank:DB00277-Compound,4377
1,UMLS:C0000727-Disease,17
2,DrugBank:DB00289-Compound,6646
3,DrugBank:DB00370-Compound,5768
4,DrugBank:DB00472-Compound,7005
...,...,...
146964,WikiPathways:WP2876-Pathway,5
146965,WikiPathways:WP2877-Pathway,37
146966,WikiPathways:WP3295-Pathway,4
146967,WikiPathways:WP4-Pathway,1


In [None]:
# 确保分组顺序与 bins 一致
category_counts_df["Category"] = pd.Categorical(
    category_counts_df["Category"], categories=labels, ordered=True
)
category_counts_df = category_counts_df.sort_values("Category")

# 绘制柱状图
fig = px.bar(
    category_counts_df,
    x="Category",
    y="Node Count",
    title="Node Count by Degree Group",
    labels={"Category": "Degree Group", "Node Count": "Number of Nodes"},
)

# 添加更清晰的布局设置
fig.update_layout(
    xaxis_title="Degree Group",
    yaxis_title="Number of Nodes",
    xaxis=dict(categoryorder="array", categoryarray=labels),  # 确保顺序
    title=dict(x=0.5),  # 标题居中
    bargap=0.1,  # 调整柱状图之间的间距
)

fig.show()

In [35]:
super_nodes = degree_df[degree_df["Degree"] > 6000]

copied_entities = pd.read_csv(entities_file, sep="\t", low_memory=False)
# Node name is a tuple of (id, label)
copied_entities["Node Name"] = copied_entities["id"] + "-" + copied_entities["label"]
annotated_super_nodes = pd.merge(super_nodes, copied_entities, on="Node Name")
annotated_super_nodes

Unnamed: 0,Node Name,Degree,id,label,name,description,resource,synonyms,pmids,taxid,xrefs,_merge
0,DrugBank:DB00289-Compound,6646,DrugBank:DB00289,Compound,ATOMOXETINE,A secondary amino compound having methyl and ...,DrugBank,(-)-Tomoxetine|Atomoxetina|Atomoxetine|Stratte...,15338851|23048018,,CHEBI:127342|CHEMBL:CHEMBL641|DrugBank:DB00289...,both
1,DrugBank:DB00472-Compound,7005,DrugBank:DB00472,Compound,FLUOXETINE,An aromatic ether consisting of 4-trifluoromet...,DrugBank,"(+-)-N-Methyl-3-phenyl-3-((alpha,alpha,alpha-t...",,,CHEBI:86990|CHEMBL:CHEMBL41|DrugBank:DB00472|M...,both
2,DrugBank:DB00313-Compound,6331,DrugBank:DB00313,Compound,VALPROIC ACID,A branched-chain saturated fatty acid that com...,DrugBank,2-PROPYL-PENTANOIC ACID|2-Propylpentanoic Acid...,11716839|12475192|15124690|15560954|15578701|1...,,CHEBI:39867|CHEMBL:CHEMBL109|DrugBank:DB00313|...,both
3,DrugBank:DB00333-Compound,7167,DrugBank:DB00333,Compound,METHADONE,,DrugBank,(+-)-Methadone|(+/-)-Methadone|(+/-)-Methadone...,,,CHEBI:167309|CHEBI:6807|CHEMBL:CHEMBL651|DrugB...,both
4,DrugBank:DB00458-Compound,8545,DrugBank:DB00458,Compound,IMIPRAMINE,"A dibenzoazepine that is 5H-dibenzo[b,f]azepin...",DrugBank,"""Antideprin""|""Berkomine""|""Censtim""|""Censtin""|""...",20825390,,CHEBI:47499|CHEMBL:CHEMBL11|DrugBank:DB00458|H...,both
...,...,...,...,...,...,...,...,...,...,...,...,...
250,UBERON:0013540-Anatomy,13562,UBERON:0013540,Anatomy,Brodmann (1909) area 9,.,UBERON,B09-9|BA9|Brodmann (1909) area 9|Brodmann area...,,,,both
251,GO:0005634-CellularComponent,10814,GO:0005634,CellularComponent,nucleus,A membrane-bounded organelle of eukaryotic cel...,GO,cell nucleus|horsetail nucleus,,,NIF_Subcellular:sao1702920020|Wikipedia:Cell_n...,both
252,UBERON:0000173-Anatomy,9784,UBERON:0000173,Anatomy,amniotic fluid,Amniotic fluid is a bodily fluid consisting of...,UBERON,acqua amnii|liquor amnii,,,,both
253,UBERON:0004801-Anatomy,8764,UBERON:0004801,Anatomy,cervix epithelium,An epithelium that is part of a uterine cervix...,UBERON,cervical canal epithelial tissue|cervical cana...,,,,both


### TBD

In [111]:
import plotly.express as px

fig = px.histogram(
    grouped_entity_df,
    x="Category",
    y="Degree",
    color="Node Type",
    title="Node Degree Distribution by Node Type",
    category_orders={"Category": labels},
    barmode="group",  # 使用分组柱状图
)
fig.show()