# **개체 (Entity)**

정의 : 인간의 개념 또는 정보의 세계에서 의미있는 하나의 정보 단위

EX) 위키피디아의 페이지
[거미(가수)](https://https://ko.wikipedia.org/wiki/%EA%B1%B0%EB%AF%B8_(%EA%B0%80%EC%88%98))

![python image2](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbqLemd%2FbtqM2G8JWTi%2FVelDRmXccGW9L0XSGrlRy0%2Fimg.png)

이러한 개체들의 관계를 이용하여 Triple, 그래프의 형태로 표현이 가능

**<개체 1, 관계, 개체 2>**

**<Subject, Relation, Object>**

**<Head, Predicate, Tail>**

**EX)**

**<거미, 성별, 여자>**

**<거미, 직업, 가수>**

**<거미, 배우자, 조정석>**

**...**

![python image2](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FJ1iWc%2FbtqM0q59rTe%2FuK81rXhfWsCLukQXGKIo50%2Fimg.png)

# **TransE**

그래프 임베딩의 한 방법 (Translation-based)

모든 개체와 관계에 대해서 Subject + Relation = Object가 되도록 모델을 학습

![python image2](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F2rojl%2FbtqMTgDd9KD%2FwNecWHJXitVQNpT9uKa67k%2Fimg.png)

![python image2](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fn53YR%2FbtqMRkspbpg%2FCcTE6O0f0zGDkuLbUShhZ1%2Fimg.png)

##Negative Sampling

TransE 모델을 학습하기 위한 방법

정답(Positive) Triple과 오답(Negative) Triple의 거리가 멀어지도록 학습

loss(x1,x2,y)=max(0,(positive−negative)+margin)

https://pytorch.org/docs/stable/generated/torch.nn.MarginRankingLoss.html

![python image2](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcLC1V5%2FbtqMThoFHpW%2FjqkGbCwfTlsubfYo227XA1%2Fimg.png)

# **Link Prediction**

<Subject, Relation, X> 혹은 <X, Relation, Object>가 주어졌을 때, X에 해당하는 개체를 찾는 문제

## Benchmark Dataset

FB15K 데이터셋

Freebase를 이용하여 구성된 데이터셋

개체 ID : /m/027rn 

위키데이터 : https://www.wikidata.org/wiki/Q786

위키피디아 : https://ko.wikipedia.org/wiki/%EB%8F%84%EB%AF%B8%EB%8B%88%EC%B9%B4_%EA%B3%B5%ED%99%94%EA%B5%AD

쿼리 검색 : https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0APREFIX%20wikibase%3A%20%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0A%0ASELECT%20%20%3Fs%20%3FsLabel%20%3Fp%20%3Fo%20%3FoLabel%20WHERE%20%7B%0A%20%3Fs%20%3Fp%20%3Fo%20.%0A%20%3Fs%20wdt%3AP646%20%22%2Fm%2F027rn%22%20.%0A%0A%20%20%20SERVICE%20wikibase%3Alabel%20%7B%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%0A%20%20%20%7D%0A%20%7D

###임포트 및 하이퍼 파라미터

In [16]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from tqdm import tqdm
from torch.utils import data

model_path = '/train_model/best_model'
data_path = '/data/FB15K'
train_batch_size = 1024
eval_batch_size = 512
epochs = 100 # 2000
learning_rate = 0.1

hidden_size = 50
eval_freq = 25
margin = 1.
seed = 3435

torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

###Data loader

In [6]:
def load_dict(path) :
    output_dict = {}

    for line in open(path, 'r').readlines()[1:] :
        key, value = line.strip().split('\t')
        output_dict[key] = int(value)

    return output_dict

class FB15K_Dataset(data.Dataset) :
    def __init__(self, path, entity2id, relation2id):
        self.entity2id = entity2id
        self.relation2id = relation2id

        self.data = []
        for line in open(path, 'r').readlines()[1:]:
            self.data.append(list(map(int, line.strip().split())))
            # str -> int (sbj_id, obj_id, rel_id)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, item):
        return self.data[item]

###데이터 생성

In [8]:
# 1. Load data set
entity2id = load_dict(os.path.join(data_path, 'entity2id.txt'))
relation2id = load_dict(os.path.join(data_path, 'relation2id.txt'))

train_dataset = FB15K_Dataset(os.path.join(data_path, 'train2id.txt'), entity2id, relation2id)
dev_dataset = FB15K_Dataset(os.path.join(data_path, 'valid2id.txt'), entity2id, relation2id)
test_dataset = FB15K_Dataset(os.path.join(data_path, 'test2id.txt'), entity2id, relation2id)

train_loader = data.DataLoader(train_dataset,
                               batch_size=train_batch_size,
                               shuffle=True)
dev_loader = data.DataLoader(dev_dataset,
                             batch_size=eval_batch_size,
                             shuffle=False)
test_loader = data.DataLoader(test_dataset,
                              batch_size=eval_batch_size,
                              shuffle=False)

print('entity to id : ', [(key, val) for key, val in entity2id.items()][:5])
print('relation to id : ', [(key, val) for key, val in relation2id.items()][:5])

entity to id :  [('/m/027rn', 0), ('/m/06cx9', 1), ('/m/017dcd', 2), ('/m/06v8s0', 3), ('/m/07s9rl0', 4)]
relation to id :  [('/location/country/form_of_government', 0), ('/tv/tv_program/regular_cast./tv/regular_tv_appearance/actor', 1), ('/media_common/netflix_genre/titles', 2), ('/award/award_winner/awards_won./award/award_honor/award_winner', 3), ('/soccer/football_team/current_roster./sports/sports_team_roster/position', 4)]


### 모델 정의

In [9]:

class TransE(nn.Module):
    def __init__(self, n_entity, n_relation, hidden_size, margin=1.0, device=True):
        super(TransE, self).__init__()
        self.device = device

        self.n_entity = n_entity
        self.n_relation = n_relation
        self.hidden_size = hidden_size

        self.entity_embedding = nn.Embedding(self.n_entity + 1, self.hidden_size, padding_idx=self.n_entity)
        self.relation_embedding = nn.Embedding(self.n_relation + 1, self.hidden_size, padding_idx=self.n_relation)

        self.init_weight(self.entity_embedding)
        self.init_weight(self.relation_embedding)

        self.loss_func = nn.MarginRankingLoss(margin=margin, reduction='none')

    def init_weight(self, embedding):
        n_vocab, hidden_dim = embedding.weight.data.size()
        sqrt_dim = hidden_dim ** 0.5

        embedding.weight.data = torch.FloatTensor(n_vocab, hidden_dim).uniform_(-6./sqrt_dim, 6./sqrt_dim)
        embedding.weight.data = F.normalize(embedding.weight.data, 2, 1)

    def get_score(self, triple):
        sbj, rel, obj = triple[:, 0], triple[:, 1], triple[:, 2]

        sbj_embedding = self.entity_embedding(sbj)
        rel_embedding = self.relation_embedding(rel)
        obj_embedding = self.entity_embedding(obj)

        score = torch.norm((sbj_embedding + rel_embedding - obj_embedding), p=1, dim=1)

        return score

    def forward(self, positive_triple, negative_triple):
        positive_score = self.get_score(positive_triple)
        negative_score = self.get_score(negative_triple)

        y = torch.tensor([-1.], dtype=torch.float, device=self.device)

        return self.loss_func(positive_score, negative_score, y)

###모델 평가

In [12]:
def hit_at_k(pred, answer, device, k=10) :
    zero_tensor = torch.tensor([0], device=device)
    one_tensor = torch.tensor([1], device=device)

    _, indices = pred.topk(k=k, largest=False)

    return torch.where(indices == answer.unsqueeze(1), one_tensor, zero_tensor).sum().item()

def MRR(pred, answer) :
    return (1. / (pred.argsort() == answer.unsqueeze(1)).nonzero()[:, 1].float().add(1.)).sum().item()

def evaluation(model, data_loader, device) :
    model.eval() # evaluation mode
    hit_at_1, hit_at_3, hit_at_10, mrr, total = 0., 0., 0., 0., 0.

    entity_ids = torch.arange(model.n_entity, device=device).unsqueeze(0)

    for sbj, obj, rel in data_loader :
        sbj, rel, obj = sbj.to(device), rel.to(device), obj.to(device)  # to GPU
        b_size = sbj.size(0)

        all_entity = entity_ids.repeat(b_size, 1)
        repeat_sbj = sbj.unsqueeze(1).repeat(1, all_entity.size(1))
        repeat_rel = rel.unsqueeze(1).repeat(1, all_entity.size(1))
        repeat_obj = obj.unsqueeze(1).repeat(1, all_entity.size(1))

        sbj_triples = torch.stack((repeat_sbj, repeat_rel, all_entity), dim=2).view(-1, 3)
        obj_triples = torch.stack((all_entity, repeat_rel, repeat_obj), dim=2).view(-1, 3)

        obj_pred_score = model.get_score(sbj_triples).view(b_size, -1)
        sbj_pred_score = model.get_score(obj_triples).view(b_size, -1)

        pred = torch.cat([sbj_pred_score, obj_pred_score], dim=0)
        answer = torch.cat([sbj, obj], dim=0)

        hit_at_1 += hit_at_k(pred, answer, device, k=1)
        hit_at_3 += hit_at_k(pred, answer, device, k=3)
        hit_at_10 += hit_at_k(pred, answer, device, k=10)

        mrr += MRR(pred, answer)
        total += pred.size(0)

    hit_at_1_score = hit_at_1 / total * 100.
    hit_at_3_score = hit_at_3 / total * 100.
    hit_at_10_score = hit_at_10 / total * 100.
    mrr_score = mrr / total * 100.

    return hit_at_1_score, hit_at_3_score, hit_at_10_score, mrr_score

###모델 생성

In [20]:
# 2. Model
model = TransE(n_entity=len(entity2id),
                n_relation=len(relation2id),
                hidden_size=hidden_size,
                margin=margin,
                device=device)
model.to(device) # to GPU
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

print('Model Structure : {}'.format(model))

Model Structure : TransE(
  (entity_embedding): Embedding(14952, 50, padding_idx=14951)
  (relation_embedding): Embedding(1346, 50, padding_idx=1345)
  (loss_func): MarginRankingLoss()
)


### 학습 및 평가

In [21]:
# 3. Training
best_score = 0.
for epoch in range(1, epochs+1) :
    model.train() # train mode
    for i, (sbj, obj, rel) in enumerate(train_loader) :
        sbj, rel, obj = sbj.to(device), rel.to(device), obj.to(device)  # to GPU

        positive_triples = torch.stack((sbj, rel, obj), dim=1) # (batch) * 3 -> (batch, 3)

        # Negative sampling
        head_or_tail = torch.randint(high=2, size=sbj.size(), device=device)
        random_entities = torch.randint(high=len(entity2id), size=sbj.size(), device=device)
        neg_sbj = torch.where(head_or_tail == 1, random_entities, sbj)
        neg_obj = torch.where(head_or_tail == 0, random_entities, obj)
        negative_triples = torch.stack((neg_sbj, rel, neg_obj), dim=1) # (batch) * 3 -> (batch, 3)

        loss = model(positive_triples, negative_triples).mean()

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print('Epoch = {}, loss = {:.6f}'. format(epoch, loss))
    if(epoch % eval_freq == 0) :
        print('evaluation...')      
        hit_at_1_score, hit_at_3_score, hit_at_10_score, mrr_score = evaluation(model, dev_loader, device)

        print('Dev set >> hit@1 : {:.2f}, hit@3 : {:.2f}, hit@10 : {:.2f}, mrr : {:.2f}'.format(hit_at_1_score,
                                                                                                hit_at_3_score,
                                                                                                hit_at_10_score,
                                                                                                mrr_score))
        if(hit_at_10_score > best_score) :
            print('best model save...')
            state_dict = model.state_dict()
            torch.save(state_dict, model_path)

model.load_state_dict(torch.load(model_path))
hit_at_1_score, hit_at_3_score, hit_at_10_score, mrr_score = evaluation(model, test_loader, device)

print('Test Set >> hit@1 : {:.2f}, hit@3 : {:.2f}, hit@10 : {:.2f}, mrr : {:.2f}'.format(hit_at_1_score,
                                                                                         hit_at_3_score,
                                                                                         hit_at_10_score,
                                                                                         mrr_score))

Epoch = 1, loss = 0.880901
Epoch = 2, loss = 0.906595
Epoch = 3, loss = 0.850646
Epoch = 4, loss = 0.873005
Epoch = 5, loss = 0.835165
Epoch = 6, loss = 0.783515
Epoch = 7, loss = 0.746253
Epoch = 8, loss = 0.748630
Epoch = 9, loss = 0.684783
Epoch = 10, loss = 0.699624
Epoch = 11, loss = 0.691527
Epoch = 12, loss = 0.608379
Epoch = 13, loss = 0.624080
Epoch = 14, loss = 0.607351
Epoch = 15, loss = 0.589332
Epoch = 16, loss = 0.608591
Epoch = 17, loss = 0.594701
Epoch = 18, loss = 0.575374
Epoch = 19, loss = 0.577998
Epoch = 20, loss = 0.566708
Epoch = 21, loss = 0.547071
Epoch = 22, loss = 0.528131
Epoch = 23, loss = 0.468423
Epoch = 24, loss = 0.500543
Epoch = 25, loss = 0.501337
evaluation...
Dev set >> hit@1 : 1.06, hit@3 : 3.84, hit@10 : 7.92, mrr : 3.43
best model save...
Epoch = 26, loss = 0.458003
Epoch = 27, loss = 0.443243
Epoch = 28, loss = 0.476310
Epoch = 29, loss = 0.450943
Epoch = 30, loss = 0.396407
Epoch = 31, loss = 0.472788
Epoch = 32, loss = 0.438596
Epoch = 33, los