## Analyze your relation file

In this jupyter notebook, we will build a graph based on your relation file and do some analysis on it. Such as the number of nodes, the number of edges, the number of subgraphs, and so on. Based on the metrics, you can know whether your relation file is valid for training or not. If your relation file have too many subgraphs and no any subgraph is large enough (e.g. the percent of the number of nodes and edges in a subgraph is no more than 90% of the total number of nodes and edges in the graph.), you may need to consider to add more relations to your relation file.

In our opinion, the number of subgraphs should be as small as possible, and the number of nodes and edges in a subgraph should be as large as possible. In this way, the model can learn more information from the graph.

## Prepare your relation file

Prepare your relation file and specify the path in the following cell. The relation file should be a csv/tsv file and the first line should be the header. For the format of the entity & relation file, please refer to the [README](../graph_data/README.md). If you want to build your own entity & relation file, please refer to the [KG README](../graph_data/KG_README.md) for more details.

We assume that the relation file is named as `knowledge_graph.tsv`, the entity file is named as `annotated_entities.tsv`, and are located in the `datasets` directory or `you can specify the path in the following cell`.

In [None]:
import os

datadir = os.path.join(os.path.dirname(os.getcwd()), "datasets", "rapex-v20240127")

In [1]:
relation_file = os.path.join(datadir, "knowledge_graph.tsv")
entitie_file = os.path.join(datadir, "annotated_entities.tsv")

if not os.path.exists(relation_file):
    raise FileNotFoundError("Relation file not found: {}".format(relation_file))

if not os.path.exists(entitie_file):
    raise FileNotFoundError("Entity file not found: {}".format(entitie_file))

## Dependencies

We defined all related functions in `lib/graph.py` module. Before doing the graph analysis, we need to import the module. In addition, we assume that you have followed the instructions in the [README](../README.md) file and have installed all the required dependencies.

In [2]:
import os
import sys

libdir = os.path.join(os.path.dirname(os.getcwd()), "lib")
sys.path.append(libdir)

from graph import (
    get_num_nodes,
    get_num_edges,
    get_num_subgraphs,
    create_graph,
    get_subgraph,
)

## Build a undirected graph from the data

In [3]:
G = create_graph(relation_file, entity_file=entitie_file, directed=False, allow_multiple_edges=True)
directed_G = create_graph(relation_file, entity_file=entitie_file, directed=True, allow_multiple_edges=True)

## How many nodes, edges, and subgraphs are there in the graph?

In [4]:
get_num_nodes(G), get_num_edges(G), get_num_subgraphs(G)

(70271, 5811734, 69)

In [5]:
get_num_nodes(directed_G), get_num_edges(directed_G), get_num_subgraphs(directed_G)

(70271, 5811734, 69)

## How many nodes and edges are related to a subgraph which starts with our target node?

In [7]:
# We assume that our target node is ME/CFS, the node id is MONDO:0005404 (see entities.tsv) and the node type is Disease.
disease = ("MONDO:0005404", "Disease")

subgraph = get_subgraph(G, start_node=disease)

get_num_nodes(subgraph), get_num_edges(subgraph)

(70132, 5811606)