## Analyze your relation file

In this jupyter notebook, we will build a graph based on your relation file and do some analysis on it. Such as the number of nodes, the number of edges, the number of subgraphs, and so on. Based on the metrics, you can know whether your relation file is valid for training or not. If your relation file have too many subgraphs and no any subgraph is large enough (e.g. the percent of the number of nodes and edges in a subgraph is no more than 90% of the total number of nodes and edges in the graph.), you may need to consider to add more relations to your relation file.

In our opinion, the number of subgraphs should be as small as possible, and the number of nodes and edges in a subgraph should be as large as possible. In this way, the model can learn more information from the graph.

## Prepare your relation file

Prepare your relation file and specify the path in the following cell. The relation file should be a csv/tsv file and the first line should be the header. For the format of the entity & relation file, please refer to the [README](../graph_data/README.md). If you want to build your own entity & relation file, please refer to the [KG README](../graph_data/KG_README.md) for more details.

We assume that the relation file is named as `knowledge_graph.tsv`, the entity file is named as `annotated_entities.tsv`, and are located in the `datasets` directory or `you can specify the path in the following cell`.

In [1]:
import os

datadir = os.path.join(os.path.dirname(os.getcwd()), "datasets", "biomedgps-v2")

In [2]:
relation_file = os.path.join(datadir, "knowledge_graph.tsv")
entitie_file = os.path.join(datadir, "annotated_entities.tsv")

if not os.path.exists(relation_file):
    raise FileNotFoundError("Relation file not found: {}".format(relation_file))

if not os.path.exists(entitie_file):
    raise FileNotFoundError("Entity file not found: {}".format(entitie_file))

## Dependencies

We defined all related functions in `lib/graph.py` module. Before doing the graph analysis, we need to import the module. In addition, we assume that you have followed the instructions in the [README](../README.md) file and have installed all the required dependencies.

In [3]:
import os
import sys

libdir = os.path.join(os.path.dirname(os.getcwd()), "lib")
sys.path.append(libdir)

from graph import (
    get_num_nodes,
    get_num_edges,
    get_num_subgraphs,
    create_graph,
    get_subgraph,
)

## Build a undirected graph from the data

In [4]:
G = create_graph(relation_file, entity_file=entitie_file, directed=False, allow_multiple_edges=True)
directed_G = create_graph(relation_file, entity_file=entitie_file, directed=True, allow_multiple_edges=True)

### How many nodes, edges, and subgraphs are there in the graph?

In [5]:
get_num_nodes(G), get_num_edges(G), get_num_subgraphs(G)

(69909, 5810212, 66)

In [6]:
get_num_nodes(directed_G), get_num_edges(directed_G), get_num_subgraphs(directed_G)

(69909, 5810212, 66)

### How many nodes and edges are related to a subgraph which starts with our target node?

In [7]:
# We assume that our target node is ME/CFS, the node id is MONDO:0005404 (see entities.tsv) and the node type is Disease.
disease = ("MONDO:0005404", "Disease")

subgraph = get_subgraph(G, start_node=disease)

get_num_nodes(subgraph), get_num_edges(subgraph)

(69776, 5810089)

### Distribution of Relationship Types in the Graph

In [38]:
from collections import Counter
import pandas as pd
import plotly.express as px

relation_types = [data["relation"] for u, v, data in G.edges(data = True)]
relation_counts = Counter(relation_types)

relation_type_df = pd.DataFrame.from_dict(relation_counts, orient="index").reset_index()
relation_type_df.columns = ["Relationship Type", "Count"]

relation_type_df

Unnamed: 0,Relationship Type,Count
0,DGIDB::OTHER::Gene:Compound,9519
1,DGIDB::INHIBITOR::Gene:Compound,3913
2,DRUGBANK::target::Compound:Gene,14479
3,bioarx::DrugHumGen:Compound:Gene,18955
4,Hetionet::CbG::Compound:Gene,11571
...,...,...
107,PrimeKG::parent-child:Disease:Disease,7
108,increased_by,6
109,inhibited_by,1
110,PrimeKG::expression_present:Gene:Anatomy,1


In [40]:
fig = px.bar(
    relation_type_df,
    x="Relationship Type",
    y="Count",
    title="Distribution of Relationship Types in the Graph",
)

fig.show()

### Distribution of Entities in the Graph

In [41]:
from collections import Counter
import pandas as pd

entities = [G.nodes[n]["node_type"] for n in G.nodes]
entity_counts = Counter(entities)

entity_df = pd.DataFrame.from_dict(entity_counts, orient="index").reset_index()
entity_df.columns = ["Entity Type", "Count"]

entity_df

Unnamed: 0,Entity Type,Count
0,Gene,25775
1,Compound,15689
2,PharmacologicClass,345
3,Disease,5528
4,Anatomy,407
5,BiologicalProcess,11396
6,CellularComponent,1395
7,MolecularFunction,2885
8,Pathway,316
9,Symptom,460


In [42]:
import plotly.express as px

fig = px.bar(
    entity_df,
    x="Entity Type",
    y="Count",
    title="Distribution of Entity Types in the Graph",
)

fig.show()

### Distribution of the number of edges of each node

In [43]:
import pandas as pd
import networkx as nx
import math

degree_sequence = dict(G.degree())
node_names = nx.get_node_attributes(G, "name")
degree_data = [
    (f"{n}-{node_names.get(n).values[0]}", degree_sequence[n], n[1])
    for n in G.nodes
]

grouped_entity_df = pd.DataFrame(degree_data, columns=["Node Name", "Degree", "Node Type"])
# 找到 Degree 列的最大值
max_degree = grouped_entity_df["Degree"].max()

# 定义 bins 和 labels
step = 100  # 设置每个 bin 的步长
bins = list(range(0, int(math.ceil(max_degree / step)) * step + step, step))
labels = [f"{bins[i]}-{bins[i + 1] - 1}" for i in range(len(bins) - 1)]
labels[-1] = f"{bins[-2]}+"  # 最后一个标签表示最大范围

grouped_entity_df["Category"] = pd.cut(
    grouped_entity_df["Degree"], bins=bins, labels=labels, right=False
)

grouped_entity_df

Unnamed: 0,Node Name,Degree,Node Type,Category
0,"('ENTREZ:2261', 'Gene')-FGFR3",1192,Gene,1100-1199
1,"('MESH:C113580', 'Compound')-U 0126",71,Compound,0-99
2,"('ENTREZ:2776', 'Gene')-GNAQ",1672,Gene,1600-1699
3,"('ENTREZ:5290', 'Gene')-PIK3CA",2688,Gene,2600-2699
4,"('ENTREZ:5728', 'Gene')-PTEN",2361,Gene,2300-2399
...,...,...,...,...
69904,"('HMDB:HMDB0001487', 'Metabolite')-NADH",1,Metabolite,0-99
69905,"('GO:0006955', 'BiologicalProcess')-immune res...",1,BiologicalProcess,0-99
69906,"('MESH:D018489', 'Symptom')-Space Motion Sickness",11,Symptom,0-99
69907,"('MESH:D055958', 'Symptom')-Piriformis Muscle ...",6,Symptom,0-99


In [44]:
import plotly.express as px

fig = px.histogram(
    grouped_entity_df,
    x="Category",
    y="Degree",
    title="Node Degree Distribution",
    category_orders={"Category": labels},
)
fig.show()

In [45]:
import plotly.express as px

fig = px.histogram(
    grouped_entity_df,
    x="Category",
    y="Degree",
    color="Node Type",
    title="Node Degree Distribution by Node Type",
    category_orders={"Category": labels},
    barmode="group",  # 使用分组柱状图
)
fig.show()