# KG-Hub: Machine Learning on Knowledge Graphs

This walkthrough provides a basic introduction to preparing KG-Hub projects for graph-based machine learning and analysis. It assumes you have already set up a KG-Hub project and have produced a merged graph. The graph should be in the `/data/merged/` directory, named `merged-kg.tar.gz`, and be in KGX TSV format.

If the merged graph is somewhere else, change the value for `merged_graph_path` below. Otherwise, just run that code block.

In [None]:
merged_graph_path = "../data/merged/merged-kg.tar.gz"

If you don't already have a graph and just want to dive in, run this next block. It will download a copy of the MONDO disease ontology graph from KG-OBO. This is not the most exciting input, but it's comparatively small and will still work in the following examples.

In [None]:
!wget https://kg-hub.berkeleybop.io/kg-obo/mondo/2022-02-04/mondo_kgx_tsv.tar.gz

In [None]:
merged_graph_path = "./mondo_kgx_tsv.tar.gz"

## Loading and processing graphs with GraPE

The [Graph Processing and Embedding (GraPE) package](https://github.com/AnacletoLAB/grape) is a comprehensive toolbox for loading, processing, describing, and otherwise learning from graphs. It has two primary components: Ensmallen, which handles graph processing, and Embiggen, which produces embeddings. Working with large, complex graphs can be very computationally intensive, so the GraPE tools use a variety of strategies to optimize efficiency. They also work very well with KG-Hub graphs!

[The full documentation for GraPE is here.](https://anacletolab.github.io/grape/index.html) You'll see that it offers a sizable collection of functions, so feel free to explore. There are also [tutorial notebooks](https://github.com/AnacletoLAB/grape/tree/main/tutorials) to peruse. For now, let's get GraPE ready, load a graph, and learn about its features.

First, install GraPE and a variety of other dependencies with `pip`:

In [None]:
%pip install grape -U

Every graph in Ensmallen is loaded as a `Graph` object, so we import that class (and `random`, because we'll use it later):

In [None]:
from ensmallen import Graph
import random

Decompress the graph, as Ensmallen will expect separate node and edge files. If your node and edge filenames differ from the values for `merged_node_filename` and `merged_edge_filename` below, please change them. 

In [None]:
!tar xvzf $merged_graph_path

In [None]:
merged_node_filename = "merged-kg_nodes.tsv"
merged_edge_filename = "merged-kg_edges.tsv"

Load the graph with Ensmallen's `from_csv` (don't worry, we will tell it that these are tsv files, not csv):

In [None]:
a_big_graph = Graph.from_csv(
    node_path=merged_node_filename,
    edge_path=merged_edge_filename,
    node_list_separator="\t",
    edge_list_separator="\t",
    node_list_header=True,  # Always true for KG-Hub KGs
    edge_list_header=True,  # Always true for KG-Hub KGs
    nodes_column='id',  # Always true for KG-Hub KGs
    node_list_node_types_column='category',  # Always true for KG-Hub KGs
    sources_column='subject',  # Always true for KG-Hub KGs
    destinations_column='object',  # Always true for KG-Hub KGs
    directed=False,
    name="A_Big_Graph",
    verbose=True
)

a_big_graph

Great, now we've loaded a graph and have some general ideas about its contents.

We can retrieve the total count of connected nodes (i.e., exclude all disconnected nodes from the count):

In [None]:
a_big_graph.get_connected_nodes_number()

We can also retrieve a random array of nodes to work with:

In [None]:
# This will output a numpy array.
# Set random_state to a specific value to get the same result reproducibly
random_int = random.randint(10000,99999)
some_nodes = a_big_graph.get_random_nodes(number_of_nodes_to_sample=10, random_state=random_int)
some_nodes

The nodes are represented as integers for the sake of efficiency. If you'd prefer names, we can get those too:

In [None]:
all_node_names = []
for node_id in some_nodes:
    node_name = a_big_graph.get_node_name_from_node_id(node_id)
    all_node_names.append((node_id,node_name))
all_node_names

We can see how many neighbors each node has (i.e., its degree):

In [None]:
all_node_degrees = []
for node_id in some_nodes:
    node_degree = a_big_graph.get_node_degree_from_node_id(node_id)
    all_node_degrees.append((node_id,node_degree))
all_node_degrees

We may also retrieve node types, starting with the node ID numbers:

In [None]:
all_node_types = []
for node_id in some_nodes:
    one_node_type = a_big_graph.get_node_type_names_from_node_id(node_id)
    if one_node_type not in all_node_types:
        all_node_types.append(one_node_type)
all_node_types

Finally, let's complete a task in preparation for the next section: assembling holdout data and sets of negative edges. Ensmallen can handle both of these.

In [None]:
# Generate and save an 80/20 training/validation split of the edges in the input graph.
train_edge_path = merged_edge_filename + ".train"
valid_edge_path = merged_edge_filename + ".valid"

train_edge_graph, valid_edge_graph = a_big_graph.random_holdout(train_size=0.8)
train_edge_graph.dump_edges(train_edge_path, edges_type_column='predicate')
valid_edge_graph.dump_edges(valid_edge_path, edges_type_column='predicate')

In [None]:
# Now the graph of negatives.
negative_graph = a_big_graph.sample_negatives(a_big_graph.get_edges_number()) # Just as many negative examples as positive examples
negative_graph = negative_graph.drop_disconnected_nodes()
negative_graph

In [None]:
# As above, this will save training and validation edge lists.
neg_train_edge_path = merged_edge_filename + ".neg_train"
neg_valid_edge_path = merged_edge_filename + ".neg_valid"

neg_train_edge_graph, neg_valid_edge_graph = negative_graph.random_holdout(train_size=0.8)
neg_train_edge_graph.dump_edges(neg_train_edge_path, edges_type_column='predicate')
neg_valid_edge_graph.dump_edges(neg_valid_edge_path, edges_type_column='predicate')

## Generating embeddings and building classifiers with NEAT

The [NEAT](https://github.com/Knowledge-Graph-Hub/NEAT) package provides a way to define graph machine learning tasks with a single configuration file. We'll generate such a file here, then run NEAT to produce embeddings and a link prediction classifier.

We'll start by defining some basic parameters, largely based on what we did in the previous section.

In [None]:
# TODO: get NEAT on Pypi so we can pip install it here
# In the meantime, install from GH with
# git clone https://github.com/Knowledge-Graph-Hub/NEAT.git

In [None]:
directed = False # Yes, this is technically a directed network, but we'll treat it as undirected
node_path = merged_node_filename # Positive training nodes
edge_path = train_edge_path # Positive training edges
#valid_edge_path - we've already defined this above
#neg_train_edge_path - we've already defined this above
#neg_valid_edge_path - we've already defined this above

# Embedding parameters
embedding_file_name = "embeddings.tsv"
embedding_history_file_name = "embedding_history.json"
node_embedding_method_name = "CBOW" # one of 'CBOW', 'GloVe', 'SkipGram', 'Siamese', 'TransE', 'SimplE', 'TransH', 'TransR'
walk_length = 10 # typically 100 or so
batch_size = 128 # typically 512 or more
window_size = 4
iterations = 5 # typically 20 or more

# Classifier parameters - NEAT can build multiple classifier types in one run, if specified in the configuration file
edge_method = "Average" # one of EdgeTransformer.methods: Hadamard, Sum, Average, L1, AbsoluteL1, L2, or alternatively a lambda
classifier_type = "Logistic Regression"
classifier_model_outfile = "model_lr"
classifier_model_type = "sklearn.linear_model.LogisticRegression"
classifier_model_random_state = 42
classifier_model_max_iter = 1000

# Output parameters
output_directory = "./"
config_filename = "scallops.yaml"

In [None]:
outstring = f"""
graph_data:
  graph:
    directed: {directed}
    node_path: {node_path}
    edge_path: {edge_path}
    verbose: True
    nodes_column: 'id'
    node_list_node_types_column: 'category'
    default_node_type: 'biolink:NamedThing'
    sources_column: 'subject'
    destinations_column: 'object'
    default_edge_type: 'biolink:related_to'
  pos_validation:
    edge_path: {valid_edge_path}
  neg_training:
    edge_path: {neg_train_edge_path}
  neg_validation:
    edge_path: {neg_valid_edge_path}

embeddings:
  embedding_file_name: {embedding_file_name}
  embedding_history_file_name: {embedding_history_file_name}
  node_embedding_params:
      node_embedding_method_name: {node_embedding_method_name}
      walk_length: {walk_length}
      batch_size: {batch_size}
      window_size: {window_size}
      return_weight: 1.0
      explore_weight: 1.0
      iterations: {iterations}
      use_mirrored_strategy: False

  tsne:
    tsne_file_name: tsne.png

classifier:
  edge_method: {edge_method}
  classifiers:
    - type: {classifier_type}
      model:
        outfile: {classifier_model_outfile}
        type: {classifier_model_type}
        parameters:
          random_state: {classifier_model_random_state}
          max_iter: {classifier_model_max_iter}

output_directory: {output_directory}
"""
print(outstring)
with open(config_filename, "w") as outfile:
    outfile.write(outstring)

In [None]:
!neat run --config $config_filename

In [None]:
from IPython.display import Image
Image(filename='tsne.png')