# Example Notebook to Create Embeddings

In this notebook we will create embeddings for the `FB15k` data.

We will then use these embeddings to initialize a new model for **out-of-vocabulary** data. This new model will freeze all of the entity and relation embeddings that are present in the training data. 

In [1]:
import config
import models
import tensorflow as tf
import numpy as np
import os


In [2]:
# Folder which contains, minimally, 3 files:
# 1. entity2id.txt
# 2. relation2id.txt
# 3. train2id.txt
data_path = "./benchmarks/FB15K/"

# Where we will save the model file and embeddings, respectively
train_file_path = "./res/model.vec.tf"
train_embedding_path = "./res/embedding.vec.json"


# Run ComplEx To Create initial embeddings

In [3]:
"""
Method:

Create ComplEx embeddings for data in `data_path`
"""

# See http://proceedings.mlr.press/v48/trouillon16.pdf for some 
# justification for Adam, and ent and rel neg rates, and alpha
con = config.Config()
con.set_in_path(data_path)
con.set_work_threads(8) # cores
con.set_train_times(5) # Number of Epochs
con.set_nbatches(200) # batches/epoch. We may wish to alter the code to instead allow setting of n_obs per batch, which is easier to interpret
con.set_dimension(100) # dimension of embedding (real+im)
con.set_ent_neg_rate(10) # 
con.set_rel_neg_rate(5) #
con.set_lmbda(0) # l2 Regularization penalty

con.set_alpha(0.001) 
con.set_opt_method("Adam")

#Models will be exported via tf.Saver() automatically.
con.set_export_files(train_file_path, 10) # How many train steps between saving json file
#Model parameters will be exported to json files automatically.
con.set_out_files(train_embedding_path)

con.init()
#Set the knowledge embedding model
con.set_model(models.ComplEx)




For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
keep_dims is deprecated, use keepdims instead


Instructions for updating:
keep_dims is deprecated, use keepdims instead


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use `tf.global_variables_initializer` instead.


Instructions for updating:
Use `tf.global_variables_initializer` instead.


In [4]:
#Train the model.
con.run()

Epoch: 0, loss: 138.63, time: 42.0, mag: 0.011, sd: 0.012, [-0.072, 0.057]
Epoch: 1, loss: 108.16, time: 41.0, mag: 0.098, sd: 0.113, [-0.442, 0.44]
Epoch: 2, loss: 41.91, time: 41.0, mag: 0.239, sd: 0.259, [-0.795, 0.801]
Epoch: 3, loss: 25.81, time: 40.0, mag: 0.293, sd: 0.319, [-1.095, 1.079]
Epoch: 4, loss: 19.19, time: 39.0, mag: 0.325, sd: 0.355, [-1.23, 1.2]


# Run ComplEx_Freeze using Embeddings produced in above step

I have produced a minimal graph dataset of new **out-of-vocabulary** entities, and relations. This data can be found in the folder `FB15K_OOV`

Let's create new embeddings for this data by freezing the embeddings produced above, and training the new graph into the embedding space

In [5]:
new_data_path = "./benchmarks/FB15K_OOV/"

# Where we will save the model file and embeddings, respectively
new_file_path = "./res/new/model.vec.tf"
new_embedding_path = "./res/new/embedding.vec.json"

In [6]:
con = config.Config()
con.set_in_path(new_data_path)
con.set_work_threads(8) # cores
con.set_train_times(10) # 10 Seems to be around the time convergence mostly happens.
con.set_nbatches(1) # batches/epoch. We may wish to alter the code to set n_obs per batch, which is easier to interpret
con.set_dimension(100) # dimension of embedding (real+im)
con.set_ent_neg_rate(10) # 
con.set_rel_neg_rate(5) #
con.set_lmbda(0) # l2 Regularization penalty

con.set_alpha(0.001) 
con.set_opt_method("Adam")


# Here we initialize the embeddings with the embeddings produced above
# The embeddings must be a .json file with keys "ent_embeddings", and "rel_embeddings"
con.set_freeze_train_embeddings(True)
con.set_embedding_initializer_path(train_embedding_path)

con.set_export_files(new_file_path, 10) # How many train steps between saving json file
con.set_out_files(new_embedding_path)

con.init()
#Set the knowledge embedding model
con.set_model(models.ComplEx_freeze)



New entities found:
-- Total Entities in embedding file: 14951
-- Total Entities in data: 14970 




New relationships found:
-- Total Relationships in embedding file: 1345
-- Total Relationships in data: 1354 




Instructions for updating:
keep_dims is deprecated, use keepdims instead


Instructions for updating:
keep_dims is deprecated, use keepdims instead


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


In [7]:
#Train the model.
con.run()

Epoch: 0, loss: 61.34, time: 0.0, mag: 0.32, sd: 0.401, [-1.176, 1.302]
Epoch: 1, loss: 75.78, time: 0.0, mag: 0.322, sd: 0.396, [-1.175, 1.303]
Epoch: 2, loss: 61.1, time: 0.0, mag: 0.324, sd: 0.403, [-1.175, 1.304]
Epoch: 3, loss: 67.29, time: 0.0, mag: 0.32, sd: 0.399, [-1.174, 1.305]
Epoch: 4, loss: 59.33, time: 0.0, mag: 0.324, sd: 0.406, [-1.174, 1.306]
Epoch: 5, loss: 57.98, time: 0.0, mag: 0.32, sd: 0.403, [-1.174, 1.307]
Epoch: 6, loss: 52.37, time: 0.0, mag: 0.32, sd: 0.401, [-1.173, 1.308]
Epoch: 7, loss: 45.19, time: 0.0, mag: 0.327, sd: 0.407, [-1.173, 1.309]
Epoch: 8, loss: 58.51, time: 0.0, mag: 0.317, sd: 0.396, [-1.173, 1.31]
Epoch: 9, loss: 55.78, time: 0.0, mag: 0.32, sd: 0.404, [-1.172, 1.311]


New embeddings for the out-of-vocabulary entities and relations are now created, and a file is saved that contains both new and old embeddings at the specified path

# Compare new and old embeddings

Here we will examine the embeddings used to initialize the new training, to the embeddings produced by the initialized model.

Our goal is to confirm that embeddings in the training data were successfully frozen, and do not change after retraining. 

We expect:

- Any entity or relationship that was created in the training data should be the same
- Any entity or relationship that was created in the training data, but involved in the test data should still be the same
- Any entity or relationship that was not created in training should be different


In [8]:
import json
with open(train_embedding_path, "r") as f: 
    old_embeddings = json.loads(f.read())
    old_ent_embeddings = old_embeddings["ent_embeddings"]
    old_rel_embeddings = old_embeddings["rel_embeddings"]


In [9]:
with open(new_embedding_path, "r") as f: 
    new_embeddings = json.loads(f.read())
    new_ent_embeddings = new_embeddings["ent_embeddings"]
    new_rel_embeddings = new_embeddings["rel_embeddings"]

### Compare Entities

In [11]:
# Entities that should be the same are all entities up to and including entity 14950
# (the maximum entity id in train2id)
# e.g. 1018, 1234, 4169
print(old_ent_embeddings[1233] == new_ent_embeddings[1233])
print(old_ent_embeddings[1234] == new_ent_embeddings[1234])
print(old_ent_embeddings[1235] == new_ent_embeddings[1235])

True
True
True


In [16]:
# Entities that are new should have embeddings in the new data, but not in the old

# Check 14950 to confirm
entity_id = 14951

try:
    print(old_ent_embeddings[entity_id])
except IndexError:
    print("No embedding for entity {} in train embeddings".format(entity_id))
    
print("Embedding for entity {} in new embeddings:".format(entity_id))    
print(new_ent_embeddings[entity_id][0:10])    

No embedding for entity 14951 in train embeddings
Embedding for entity 14951 in new embeddings:
[0.2992086708545685, -0.10801243782043457, -0.5317354798316956, -0.6309638619422913, 0.3361939787864685, 0.7201786041259766, -0.011285864748060703, -0.8057375550270081, 0.22400900721549988, -0.2980792224407196]


### Compare Relationships

In [15]:
# Relationships that should be the same are all entities up to and including 1344
# (the maximum relation id in train2id)
# e.g. 38, 58, 135
print(old_rel_embeddings[57] == new_rel_embeddings[57])
print(old_rel_embeddings[58] == new_rel_embeddings[58])
print(old_rel_embeddings[59] == new_rel_embeddings[59])

True
True
True


In [17]:
# Relationships that are new should have embeddings in the new data, but not in the old

# Check 1344 to confirm
rel_id = 1345

try:
    print(old_rel_embeddings[rel_id])
except IndexError:
    print("No relationship embedding for {} in train embeddings".format(rel_id))
    
print("Embedding for relationship {} in new embeddings:".format(rel_id))    
print(new_rel_embeddings[rel_id][0:10])  

No relationship embedding for 1345 in train embeddings
Embedding for relationship 1345 in new embeddings:
[0.0993245393037796, -0.6665051579475403, -0.0976848378777504, 0.19483378529548645, -0.6981079578399658, -0.2871093153953552, 0.2919817864894867, -0.31209680438041687, -0.07304192334413528, 0.31981706619262695]


### Lengths should be as expected

In [18]:
print(len(old_ent_embeddings))
print(len(new_ent_embeddings))
print(len(old_rel_embeddings))
print(len(new_rel_embeddings))

14951
14970
1345
1354
