# Assignment — Multihop query answering

In this task, we will implement the [betaE](http://snap.stanford.edu/betae/) model. In comparison to the Query2Box, it is able to handle the negation operator

Firstly, we need to download data


In [None]:
import torch
from torch import nn
import torch.nn.functional as F
import requests
import pandas as pd
import numpy as np
from zlib import adler32
from tqdm.notebook import trange

In [None]:
url = 'https://raw.githubusercontent.com/netspractice/network-science/main/datasets/countries_edges.tsv'
open('countries_edges.tsv', 'wb').write(requests.get(url).content)
url = 'https://raw.githubusercontent.com/netspractice/network-science/main/datasets/countries_entities.tsv'
open('countries_entities.tsv', 'wb').write(requests.get(url).content)
url = 'https://raw.githubusercontent.com/netspractice/network-science/main/datasets/countries_relations.tsv'
open('countries_relations.tsv', 'wb').write(requests.get(url).content);


edges = pd.read_csv('countries_edges.tsv', sep='\t').values
entity_labels = pd.read_csv('countries_entities.tsv', sep='	', index_col=0).label.values
relation_labels = pd.read_csv('countries_relations.tsv', sep='	', index_col=0).label.values

edges_labeled = np.stack([entity_labels[edges[:, 0]], 
                          entity_labels[edges[:, 1]], 
                          relation_labels[edges[:, 2]]], axis=1)

df = pd.DataFrame(edges_labeled, columns=['h', 't', 'r'])[['h', 'r', 't']]
df = df[df.h != df.t].reset_index(drop=True)
df.head()

Now, let us encode the entity names with indices

In [None]:
ent2id = sorted(list(set(df.h) | set(df.t)))
ent2id = dict(zip(ent2id, range(len(ent2id))))

rel2id = sorted(list(set(df.r)))
rel2id = dict(zip(rel2id, range(len(rel2id))))

df.h = df.h.map(ent2id)
df.t = df.t.map(ent2id)
df.r = df.r.map(rel2id)

## Task 1. Mine queries (2 points)

To train our model, we need to mine the queries for the training and validation of our model. In this task, we will consider only queries of two types:

1. intersection of two projections ($V_?: Relation_1(V_?, Anchor_1) \land Relation_2(V_?, Anchor_2)$). For example, let us consider a query on natural language.  List the European countries that have held the World Cup. It could be converted to the logical statement: $V_?: Located(Europe, V_?) \land Held(Word Cup, V_?)$.
1. intersection of projection and negation of projection ($V_?: Relation_1(Anchor_1, V_?) \land \neg Relation_2(Anchor_2, V_?)$). For example, let us consider a query on natural language.  List the European countries that have __never__ held the World Cup. It could be converted to the logical statement: $V_?: Located(Europe, V_?) \land \neg Held(Word Cup, V_?)$.


To find such queries, we will use the `mine_intersection_and_negation` method.

The first type of query was described on the [seminar](https://github.com/netspractice/advanced_gnn/blob/main/lab_multihop/lab.ipynb) in the method `generete_queries_conjunction`. The general intuition is to find head and relation pairs that project to the same tails. One way to handle it is to join our triplets on itself, keyed by the tail column. It will return the dataset with two head-relation pairs and a similar tail.

The second type of query has a negation, and we can find it similarly. Firstly, we will find such head-relation pairs that lead to similar tails. Secondly, we will find all possible answers for one joined head-relation pair. Finally, we will remove the intersection from the answers found before.


The method `mine_intersection_and_negation` should work as follows:

The first three steps are similar to the `generate_queries_conjunction` method from the seminar.

1. Merge triplet dataframe on itself by column "t" (let us call it `df_merged`)
2. Remove lines from `df_merged` where heads and tails (from left and right parts after join) are similar
3. Group the `df_merged` on heads and relations from both parts of the dataset and aggregate tails as a list (let us call it `df_intersection`)
4. Add `is_negation` column to `df_intersection` with zero values.

Mine negation examples

5. Group the `df_merged` by left heads and relations, aggregate tails to set (let us call it `df_projection`)
6. Merge `df_projection` and `df_merged` on head and relation (let us call it `df_negation_pre`)
7. Group `df_negation_pre` by heads and relations from both previous datasets
8. Aggregate tails from `df_projection` with `"first"` (let us call this column as `positive_tails`) and tails from `df_merged` with `set` (let us call this column as `negative_tails` and whole df as `df_negation`)
9. For each row in `df_negation` remove `negative_tails` values from `positive_tails` (name it as `t`)
10. Remove all lines from `df_negation` where the set of tails (`t`) is empty
11. Add column `is_negation` with ones.


12. Filter columns in `df_negation` and `df_intersection` to `["h_x", "r_x", "h_y", "r_y", "t", "is_negation"]`
13. Append `df_negation` to `df_intersection`
14. Return numpy values of it

Unencoded example (we change entity names to indices so it will look different)

```
[
    [Head_1, Relation_1, Head_2, Relation_2, Tails, is_negation], # columns
    [Russia, separated from, Yugoslavia, diplomatic relation, [Soviet Union], 0],
    [Russia, diplomatic relation, Germany, diplomatic relation, [Germany, Republic of Abkhazia, South Ossetia, State of Palestine,], 1],
    ...
]

```

In [None]:
def mine_intersection_and_negation(df):
    # YOUR CODE HERE
    raise NotImplementedError()

Now, we can mine queries and split our data into train, validation and test parts.

In [None]:
queries = mine_intersection_and_negation(df)

np.random.seed(0)

perm = np.random.permutation(queries.shape[0])
train_queries = queries[perm[: int(perm.shape[0] * 0.98)]]
val_queries = queries[perm[int(perm.shape[0] * 0.98): int(perm.shape[0] * 0.99)]]
test_queries = queries[perm[int(perm.shape[0] * 0.99):]]

In [None]:
assert queries.shape == (482642, 6)
assert (
    (queries[:, 0] == ent2id['German Empire']) &
    (queries[:, 1] == rel2id['shares border with']) &
    (queries[:, 2] == ent2id['County of Astarac']) &
    (queries[:, 3] == rel2id['country']) &
    (queries[:, 5] == 1)
).sum() == 1

## Task 2. Negation Dataset (2 points)

Let us create the torch Dataset that will iterate over our query array and sample negatives.

The first method, `sample_negative`, should sample a random entity_id that is not in the positive array (`tails` from the above example).

The second one, `__getitem__`, should work as follows:

1. Get row from query array
2. Sample negative examples opposite to positive ones
3. Select random positive example
3. Convert `is_negation` flag with `1 - is_negation`
4. Return the (head_1, relation_1, head_2, relation_2, converted_is_negation, positive sample, negative sample)


In [None]:
import random

random.seed(0)

class NegationDataset(torch.utils.data.Dataset):
    def __init__(self, queries, n_ent):
        self.queries = queries
        self.n_ent = n_ent
    
    def sample_negative(self, positives):
        # YOUR CODE HERE
        raise NotImplementedError()
    
    def __len__(self):
        return len(self.queries)
    
    def __getitem__(self, idx):
        # YOUR CODE HERE
        raise NotImplementedError()

Load queries to `NegationDataset` and torch DataLoaders

In [None]:
train_ds = NegationDataset(train_queries, len(ent2id))
val_ds = NegationDataset(val_queries, len(ent2id))
test_ds = NegationDataset(test_queries, len(ent2id))

train_loader = torch.utils.data.DataLoader(train_ds, batch_size=512, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_ds, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_ds, batch_size=32, shuffle=True)

In [None]:
anchor1, relation1, anchor2, relation2, neg_pow, positive, negative = train_ds.__getitem__(0)

assert (anchor1, relation1, anchor2, relation2, neg_pow) == tuple(train_queries[0][:4].tolist() + [1 - train_queries[0][-1]])
assert positive in train_queries[0, 4]
assert negative not in train_queries[0, 4]

## Task 3. Model (6 points: 1.5 points per method and 1.5 for metrics)

![](http://snap.stanford.edu/betae/model.png)

The general idea of the BetaE method is to model queries with [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution). For example, let us consider the above image. The higher the Beta distribution density for a specific vector, the brighter the field on the space. We can think about it as a continuous box from the Query2Box model. In that work, we cover the hyperrectangular region to encode query where possible answer tails lie. Similarly, the distribution allows us to continuously highlight the most probable regions with answers.

The beta distribution has two parameters: `alpha` and `beta` vectors. So, firstly, we will encode all our queries with two embedding vectors for each parameter. The `alpha` and `beta` parameters are always positive, so we will need to clamp vectors in some positive interval. It will be done with the `Regularizer` helper class.

To handle intersection queries (`BetaIntersection` class), we will calculate center vectors similarly to the previous models from the seminar (GQE and Query2Box). We will take the attention-weighted mean of the input vectors.

However, the projection procedure is different (`BetaProjection` class). It will be done through the MergeLayer, i.e. we concatenate the entity and relation vectors and pass them through the fully-connected network.

We will copy helper classes from the original paper [implementation](https://github.com/snap-stanford/KGReasoning).

In [None]:
class Regularizer:
    def __init__(self, base_add, min_val, max_val):
        self.base_add = base_add
        self.min_val = min_val
        self.max_val = max_val

    def __call__(self, entity_embedding):
        return torch.clamp(entity_embedding + self.base_add, self.min_val, self.max_val)


class BetaIntersection(nn.Module):
    def __init__(self, dim):
        super(BetaIntersection, self).__init__()
        self.dim = dim
        self.layer1 = nn.Linear(2 * self.dim, 2 * self.dim)
        self.layer2 = nn.Linear(2 * self.dim, self.dim)

        nn.init.xavier_uniform_(self.layer1.weight)
        nn.init.xavier_uniform_(self.layer2.weight)

    def forward(self, alpha_embeddings, beta_embeddings):
        all_embeddings = torch.cat([alpha_embeddings, beta_embeddings], dim=-1)
        layer1_act = F.relu(self.layer1(all_embeddings))  # (num_conj, batch_size, 2 * dim)
        attention = F.softmax(self.layer2(layer1_act), dim=0)  # (num_conj, batch_size, dim)

        alpha_embedding = torch.sum(attention * alpha_embeddings, dim=0)
        beta_embedding = torch.sum(attention * beta_embeddings, dim=0)

        return alpha_embedding, beta_embedding

class BetaProjection(nn.Module):
    def __init__(self, entity_dim, relation_dim, hidden_dim, projection_regularizer, num_layers):
        super(BetaProjection, self).__init__()
        self.entity_dim = entity_dim
        self.relation_dim = relation_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.layer1 = nn.Linear(self.entity_dim + self.relation_dim, self.hidden_dim)  # 1st layer
        self.layer0 = nn.Linear(self.hidden_dim, self.entity_dim)  # final layer
        for nl in range(2, num_layers + 1):
            setattr(self, "layer{}".format(nl), nn.Linear(self.hidden_dim, self.hidden_dim))
        for nl in range(num_layers + 1):
            nn.init.xavier_uniform_(getattr(self, "layer{}".format(nl)).weight)
        self.projection_regularizer = projection_regularizer

    def forward(self, e_embedding, r_embedding):
        x = torch.cat([e_embedding, r_embedding], dim=-1)
        for nl in range(1, self.num_layers + 1):
            x = F.relu(getattr(self, "layer{}".format(nl))(x))
        x = self.layer0(x)
        x = self.projection_regularizer(x)
        return x

The final task is to define the simplified version of the BetaE model. The main difference from the original implementation is that it can handle only two types of queries described above.

You will need to define three methods: `encode_projection`, `encode_query`, `calc_logit`.

The `encode_projection` function projects the entity with given relation.

It takes two arguments: a tensor with anchor ids (entities) and a tensor with relation ids. It works as follows:

1. Select `self.entity_embedding` subset according to `anchor`
2. Select `self.relation_embedding` subset according to `relation`
3. Regularize entity embeddings from 1 with `self.entity_regularizer`
4. Calculate projection embeddings with `self.projection_net`
5. Split embedding from point 4 with `torch.chunk` on two parts (`alpha` and `beta` embeddings)

The `calc_logit` function takes entity ids and beta distributions (`q`)

1. Select `self.entity_embedding` subset according to `entities`
2. Regularize entity embeddings from 1 with `self.entity_regularizer`
3. Split embedding from point 2 with `torch.chunk` on two parts (alpha and beta embeddings)
4. Calculate beta distributions (`torch.distributions.beta.Beta`) (let us call it `p`)
5. Calculate logit using the following formula:

$$\text{logit} = \gamma - \Vert D_{KL}(p, q)\Vert_1,$$

where `p` and `q` are the Beta distributions. `p` is a distribution derived here in point 4, and `q` is the input parameter.

The negation operator will be encoded in the `encode_query` method. The general idea is to take the reciprocal of the initial Beta distribution. We can do it by taking the `-1` power of the input `alpha` and `beta` embeddings.

![](http://snap.stanford.edu/betae/betae.png)

The `encode_query` method takes anchors and relations of two parts of query (anchor1, relation1, anchor2, relation2) and 1 - negation flag. It works as follows:

1. Calculate projection alpha and beta embeddings for anchor1 and relation1
2. Calculate projection alpha and beta embeddings for anchor2 and relation2
3. Put alpha and beta embeddings from point 2 to the power of `neg_pow`
4. Calculate whole query alpha and beta embeddings via `self.center_net` (stack alphas and betas, pass it to the function)
5. Calculate beta distributions (`torch.distributions.beta.Beta`)

In [None]:
class BetaE(nn.Module):
    def __init__(self, gamma, entity_dim, nentity, nrelation, num_layers):
        super(BetaE, self).__init__()
        self.entity_dim = entity_dim
        self.relation_dim = entity_dim
        self.epsilon = 2.0
        self.entity_regularizer = Regularizer(1, 0.05, 1e9)
        self.projection_regularizer = Regularizer(1, 0.05, 1e9)
        self.gamma = nn.Parameter(
            torch.Tensor([gamma]),
            requires_grad=False
        )
        self.entity_embedding = nn.Parameter(torch.zeros(nentity, self.entity_dim * 2))  # alpha and beta
        self.relation_embedding = nn.Parameter(torch.zeros(nrelation, self.relation_dim))
        self.__init_embeddings__()
        self.center_net = BetaIntersection(self.entity_dim)
        self.projection_net = BetaProjection(self.entity_dim * 2,
                                             self.relation_dim,
                                             self.entity_dim,
                                             self.projection_regularizer,
                                             num_layers)
    
    def __init_embeddings__(self):
        emb_range = (self.gamma.item() + self.epsilon) / self.entity_dim
        nn.init.uniform_(
            self.entity_embedding,
            a=-emb_range,
            b=emb_range
        )

        nn.init.uniform_(
            self.relation_embedding,
            a=-emb_range,
            b=emb_range
        )
    
    def forward(self, anchor1, relation1, anchor2, relation2, neg_pow, positive, negative):
        dists = self.encode_query(anchor1, relation1, anchor2, relation2, neg_pow)
        return self.calc_logit(positive.reshape(-1, 1), dists), self.calc_logit(negative.reshape(-1, 1), dists)

    def encode_projection(self, anchor, relation):
        ### BEGIN SOLUTION
        embedding = self.entity_regularizer(self.entity_embedding[anchor])
        r_embedding = self.relation_embedding[relation]
        embedding = self.projection_net(embedding, r_embedding)
        return torch.chunk(embedding, 2, dim=-1)
        ### END SOLUTION

    def calc_logit(self, entities, dists):
        # YOUR CODE HERE
        raise NotImplementedError()

    def encode_query(self, anchor1, relation1, anchor2, relation2, neg_pow):
        # YOUR CODE HERE
        raise NotImplementedError()

In [None]:
model = BetaE(24, 800, len(ent2id), len(rel2id), 2)

In [None]:
a, b = model.encode_projection(torch.LongTensor([0]), torch.LongTensor([0]))

assert a.shape == torch.Size([1, 800])
assert b.shape == torch.Size([1, 800])
assert a.min().item() >= 0.05
assert b.min().item() >= 0.05
assert a.max().item() <= 1e9
assert b.max().item() <= 1e9

In [None]:
beta = torch.distributions.beta.Beta(torch.ones((1, 800)), torch.ones((1, 800)))
l = round(model.calc_logit(torch.LongTensor([0]), beta).item(), 4)
assert l > 23
assert l < 24

In [None]:
a = model.encode_query(torch.LongTensor([0]), torch.LongTensor([0]), torch.LongTensor([0]), torch.LongTensor([0]), torch.LongTensor([-1]))

assert a.batch_shape == torch.Size([1, 1, 800])

x = (a.concentration0 > a.concentration1).float().mean().item()
assert (x > 0.45) and (x < 0.55)

Let us define the training and evaluation loop. Our goal is to minimize the distance between the query embedding and positive answer embedding and maximize it between the query and negative answer. We can utilize the loss from Word2vec paper.

$$L = - \log{\sigma (\gamma - D_{KL}(positive, query))} - \log{\sigma (D_{KL}(negative, query) - \gamma)}$$

To evaluate our model, we will use the Accuracy@5. It means that we will predict the five closest answers to the query and check that a positive example lies within them.

In [None]:
from tqdm import tqdm

def train(model, opt, loader):
    model.train()
    loss_log = []
    for idx, i in enumerate(tqdm(loader)):
        anchor1, relation1, anchor2, relation2, neg_pow, positive, negative = [
            j.to(device) for j in i
        ]

        pos, neg = model(anchor1, relation1, anchor2, relation2, neg_pow, positive, negative)
        loss = - F.logsigmoid(pos).mean() - F.logsigmoid(-neg).mean()

        opt.zero_grad()
        loss.backward()
        opt.step()
        loss_log.append(loss.item())
    return loss_log


def evaluate(model, loader):
    model.eval()
    entities = torch.repeat_interleave(torch.arange(len(ent2id)), 32).reshape(-1, 32).T
    pos = 0
    total = 0
    for i in tqdm(loader):
        anchor1, relation1, anchor2, relation2, neg_pow, positive, negative = [
            j.to(device) for j in i
        ]
        dists = model.encode_query(anchor1, relation1, anchor2, relation2, neg_pow)
        if anchor1.shape[0] < entities.shape[0]:
            entities = entities[:anchor1.shape[0]]
        logit = model.calc_logit(entities, dists)
        topk = logit.argsort(descending=True)[:, :5]
        pos += (positive.reshape(-1, 1) == topk).float().max(dim=1).values.sum().item()
        total += positive.shape[0]
    return pos / total

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
from matplotlib import pyplot as plt
from IPython.display import clear_output

model = BetaE(24, 800, len(ent2id), len(rel2id), 2)
model = model.to(device)
opt = torch.optim.Adam(model.parameters(), 3e-4)

# train_loss_log = []
# val_loss_log = []
# acc_log = []
# for epoch in range(25):
#     loss = train(model, opt, train_loader)
#     train_loss_log.append(np.mean(loss))
#     acc = evaluate(model, val_loader)
#     acc_log.append(acc)

#     clear_output()
#     plt.plot(train_loss_log, label="train")
#     plt.legend()
#     plt.show()
#     plt.plot(acc_log)
#     plt.show()

# with open("multihop-weights.pth", "wb") as f:
#     torch.save(model.state_dict(), f)

Download your weights here

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
# model.load_state_dict(state_dict)

In [None]:
model.eval()
test_ds = NegationDataset(test_queries[:200], len(ent2id))
test_loader = torch.utils.data.DataLoader(test_ds, batch_size=32, shuffle=True)
score = evaluate(model, test_loader)
print()
print(f"Test accuracy: {score:.4f}")

assert (score > 0.6)

Let us check some predictions

In [None]:
id2ent = {j: i for i, j in ent2id.items()}

dists = model.encode_query(
    torch.LongTensor([ent2id["Turkmenistan"]]).to(device),
    torch.LongTensor([rel2id["shares border with"]]).to(device),
    torch.LongTensor([ent2id["Turkey"]]).to(device),
    torch.LongTensor([rel2id["shares border with"]]).to(device),
    torch.Tensor([-1]).to(device)
)

logit = model.calc_logit(torch.arange(len(ent2id)).reshape(1, -1), dists)
[id2ent[i.item()] for i in logit.argsort(descending=True)[0, :5]]

In [None]:
dists = model.encode_query(
    torch.LongTensor([ent2id["Turkmenistan"]]).to(device),
    torch.LongTensor([rel2id["shares border with"]]).to(device),
    torch.LongTensor([ent2id["Turkey"]]).to(device),
    torch.LongTensor([rel2id["shares border with"]]).to(device),
    torch.Tensor([1]).to(device)
)

logit = model.calc_logit(torch.arange(len(ent2id)).reshape(1, -1), dists)
[id2ent[i.item()] for i in logit.argsort(descending=True)[0, :5]]

In [None]:
turkmenistan_neighbors = set(df[(df.h == ent2id["Turkmenistan"]) & (df.r == rel2id["shares border with"])].t)
turkey_neighbors = set(df[(df.h == ent2id["Turkey"]) & (df.r == rel2id["shares border with"])].t)
print("Neighbors intersection: ", ', '.join([id2ent[i] for i in turkmenistan_neighbors & turkey_neighbors]))
print("Neighbors diff: ", ', '.join([id2ent[i] for i in turkmenistan_neighbors - turkey_neighbors]))

The result looks like a pretty close prediction. Iran is within top-5 answers for intesection query. And it is not presented in top-5 negation query predictions.