# Graph embedding using SkipGram

This is a duplicate of the other Skipgram NB, except we are embedding without CORD19 to investigate the effect of publications on the ranking of ChEMBL antivirals. Specifically we want to know whether this ranking is simply ranking the antiviral drugs by regurgitating the knowledge it extracts from publications.

The SkipGram model predicts the context using the central word.

In our implementatation, as for both the Binary SkipGram model and the CBOW model, since the batches walks are lazily generated, the memory requirements are minimal and the method can scale to very big graphs. It can also run on graphs like Monarch (150M edges and 50M nodes) provided that you use a GPU ([or better still a TPU](https://cloud.google.com/ai-platform/training/docs/using-tpus#console)) that is able to fit an embedding model that big, but that is just related to the shear number of nodes.

In [1]:
import sys
print(sys.path)

['/home/jtr4v/kg_covid_19_drug_analyses', '/home/jtr4v/anaconda3/lib/python38.zip', '/home/jtr4v/anaconda3/lib/python3.8', '/home/jtr4v/anaconda3/lib/python3.8/lib-dynload', '', '/home/jtr4v/anaconda3/lib/python3.8/site-packages', '/home/jtr4v/anaconda3/lib/python3.8/site-packages/IPython/extensions', '/home/jtr4v/.ipython']


In [2]:
import silence_tensorflow.auto # Import needed to avoid TensorFlow warnings and general useless infos.

## Loading the graphs
We load the ppi graph from the repository as an undirected graph.

In [3]:
import urllib
import os
os.makedirs("graph_no_cord_19", exist_ok=True)
if not os.path.exists("graph_no_cord_19/merged-kg.tar.gz"):
    with urllib.request.urlopen("https://zenodo.org/record/4012578/files/merged-kg.tar.gz") as response, \
        open("graph_no_cord_19/merged-kg.tar.gz", 'wb') as out_file:  # type: ignore
            data = response.read()  # a `bytes` object
            out_file.write(data)

In [4]:
import os
os.system("tar -xvzf graph_no_cord_19/merged-kg.tar.gz -C graph_no_cord_19")

0

In [5]:
from ensmallen_graph import EnsmallenGraph

graph = EnsmallenGraph.from_csv(
    edge_path="graph_no_cord_19/merged-kg_edges.tsv",
    sources_column="subject",
    destinations_column="object",
    directed=False,
    default_edge_type="biolink:association",
    node_path="graph_no_cord_19/merged-kg_nodes.tsv",
    nodes_column="id",
    node_types_column="category",
    default_node_type="biolink:NamedThing",
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,
    force_conversion_to_undirected=True
)

As first thing, we print a short report showing all the avalable graph details, including the number of edges, nodes, trap nodes and both the connected components and the strongly connected components.

In [None]:
graph.report()

The followings are check that are not necessary, but are offered as sanity checks:

### Considered parameters
We are going to use the following parameters:

- **Walk lengths:** $100$ nodes.
- **Batch size:** $2^{7} = 128$ walks per batch.
- **Walk iterations:** $20$ iterations on the graph.
- **Window size:** $4$ nodes, meaning $4$ on the left and $4$ on the right of the center nodes. Consider that the first *window_size* values on the left and the right of the walks will be trimmed.
- **Return weight, inverse of $p$:** $1.0$.
- **Explore weight, inverse of $q$:** $1.0$.
- **Embedding size:** $100$.
- **Negative samples:** For the porpose of the [NCE function negative samples](https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss), we are going to use $10$. These are the number of negative classes to randomly sample per batch. This single sample of negative classes is evaluated for each element in the batch.
- **Optimizer:** [Nadam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Nadam).
- **Early stopping parameters:** We are going to use an Early Stopping criterion on the *validation loss*, with patience $5$ and delta $0.0001$.
- **Epochs:** The model will be trained up to $1000$ epochs.
- **Learning rate:** since tipically the loss function is quite convex for the embedding problem, we can use a relatively higher learning rate. We are going to us $0.1$ for this example, to get to a faster convergence: this might lead to skipping some better minima that might be identified with a lower learning rate, such as the default one which is $0.0001$.

#### Setting up the parameters

In [16]:
walk_length=100
batch_size=2**7
iterations=20
window_size=4
p=1.0
q=1.0
embedding_size=100
negatives_samples=30
patience=5
delta=0.0001
epochs=1000
learning_rate=0.1

#### Creating the training and validation Keras sequences

In [20]:
from embiggen import Node2VecSequence

graph_sequence = Node2VecSequence(
    graph,
    walk_length=walk_length,
    batch_size=batch_size,
    iterations=iterations,
    window_size=window_size,
    return_weight=1/p,
    explore_weight=1/q
)

## Creating the SkipGram model
We are going to setup the model to use, if available, multiple GPUs.

In [21]:
from tensorflow.distribute import MirroredStrategy
from tensorflow.keras.optimizers import Nadam
from embiggen import SkipGram

strategy = MirroredStrategy()
with strategy.scope():
    model = SkipGram(
        vocabulary_size=graph.get_nodes_number(),
        embedding_size=embedding_size,
        window_size=window_size,
        negatives_samples=negatives_samples,
        optimizer=Nadam(learning_rate=learning_rate)
    )

model.summary()

Model: "SkipGram"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
words_embedding (InputLayer)    [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 1, 100)       20519300    words_embedding[0][0]            
__________________________________________________________________________________________________
flatten_2 (Flatten)             (None, 100)          0           embedding_2[0][0]                
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 8)]          0                                            
___________________________________________________________________________________________

## Tuning the SkipGram model

In [22]:
from tensorflow.keras.callbacks import EarlyStopping

history = model.fit(
    graph_sequence,
    steps_per_epoch=graph_sequence.steps_per_epoch,
    epochs=1000,
    callbacks=[
        EarlyStopping(
            "val_loss",
            min_delta=delta,
            patience=patience,
            restore_best_weights=True
        )
    ]
)

Epoch 1/1000
 347/1536 [=====>........................] - ETA: 3:58 - loss: 110.8636

KeyboardInterrupt: 

### Saving the model weights
We save the obtained model weights:

In [None]:
model.save_weights(f"{model.name}_no_cord19_weights.h5")

### Visualizing the training history
We can visualize the performance of the model during the training process as follows:

In [None]:
from plot_keras_history import plot_history

plot_history(history)

There may be some hickups in the plot of the history if the model is reloaded from stored weights: [this is a known Keras issue](https://github.com/keras-team/keras/issues/4875) and is not related to either the holdouts used or the model.

## Saving the obtained embeddings
Finally we save our hard earned model embeddings. In another notebook we will show how to do link prediction on the obtained embedding.

In [None]:
import numpy as np

np.save(f"{model.name}_no_cord19_embedding.npy", model.embedding)