# 2020语言与智能技术竞赛：关系抽取任务
https://aistudio.baidu.com/aistudio/competition/detail/31

关系抽取 (Relation Extraction, RE) 是从自然语言文本中抽取实体及其之间关系的信息技术，是信息检索、智能问答、智能对话等人工智能应用的重要基础，一直受到业界的广泛关注。关系抽取任务涉及命名实体识别、指代消解、关系分类等复杂技术，极具挑战性。

本次竞赛在去年信息抽取竞赛的基础上进行了升级。整体的任务形式仍然是 schema 约束下的关系抽取，也就是在给定关系集合下，从自然语言文本中抽取出符合关系 schema 约束的 SPO 三元组知识，但是对 O 值形态进行了复杂化的扩展。相信这会给参赛者带来更大的挑战和乐趣。

除此之外，本次竞赛将提供业界规模最大的中文关系抽取数据集 DuIE 2.0，旨在为研究者提供学术交流平台，进一步提升中文关系抽取技术的研究水平，推动相关人工智能应用的发展。
## Agenda

    2020/3/10 	启动竞赛报名，发放样例数据
    2020/3/31 	开放评测入口和排行榜，对报名者发放全部训练数据和第一批测试数据
    2020/5/12 	报名截止
    2020/5/13 	发放最终测试数据
    2020/5/20 	系统结果提交截止
    2020/5/30 	公布竞赛结果，接收系统报告和论文
    2020/6/30 	论文提交截止日期
    2020/7 	在“语言与智能高峰论坛”上交流和颁奖

## 数据介绍 Data

本次竞赛数据集共包含 48个已定义好的schema和超过21万中文句子，其中包括17万训练集，2万验证集和2万测试集，共分为以下5个部分：

1.训练集：共17万个句子，包含句子中对应的SPO，用于竞赛模型训练。

2.验证集：共2万个句子，包含句子中对应的SPO，用于竞赛模型训练和参数调试。

3.schema约束：共48个限定的schema，定义了关系P以及其对应的主体S和客体O的类别。

4.测试集1：约1万个句子，不包含句子中对应的SPO，用于参赛者在平台上自助提交模型预测结果、验证效果。

5.测试集2:本次竞赛最终测试集，约2万个句子，不包含句子对应的SPO，包含测试集1。另外为了防止针对测试集的调试，数据中将会额外加入混淆数据。该部分数据在评测结束前一周发布，结果不能在平台上自助验证，由评测委员会进行离线评测。
### 样例数据 Data Sample

平台提供的数据为JSON文件格式，样例如下:

    {
        "text":"王雪纯是87版《红楼梦》中晴雯的配音者，她是中央台《正大综艺》的主持人",
        "spo_list":[
            {
                "predicate":"配音",
                "subject_type":"娱乐人物",
                "object":{
                    "@value":"晴雯",
                    "inWork":"红楼梦"
                },
                "object_type":{
                    "@value":"人物",
                    "inWork":"影视作品"
                },
                "subject":"王雪纯"
            },
            {
                "predicate":"主持人",
                "subject_type":"电视综艺",
                "object":{
                    "@value":"王雪纯"
                },
                "object_type":{
                    "@value":"人物"
                },
                "subject":"正大综艺"
            }
        ]
    }

更多样例和详细数据格式说明参见数据集中包含的“数据格式说明”文档。
Please refer to the specification in dataset package for details.
### 入门参考

1.基线系统：一个开源的基于schema的信息抽取基线系统，将在3月31日前在比赛网站上发布。
Baseline System: An open sourced baseline for information extraction will be released on March 31st.

2.飞桨使用教程：请参考 飞桨官网
PaddlePaddle tutorial: please refer to the PaddlePaddle website

# baseline by苏剑林
* 百度LIC2020的关系抽取赛道，非官方baseline
* 基于“半指针-半标注”结构
* 文章介绍：https://kexue.fm/archives/7161
* 在第一期测试集上能达到0.68的F1，略低于官方baseline

In [1]:
import json
import numpy as np
from bert4keras.backend import keras, K, batch_gather
from bert4keras.layers import LayerNormalization
from bert4keras.tokenizers import Tokenizer
from bert4keras.models import build_transformer_model
from bert4keras.optimizers import Adam
from bert4keras.snippets import sequence_padding, DataGenerator
from bert4keras.snippets import open, groupby
from keras.layers import Input, Dense, Lambda, Reshape
from keras.models import Model
from tqdm import tqdm

# 基本信息
maxlen = 256
epochs = 20
batch_size = 16
learning_rate = 2e-5

Using TensorFlow backend.


In [2]:
bert_dir = '/Users/luoyonggui/Documents/nlpdata/chinese_roberta_wwm_ext_L-12_H-768_A-12'
config_path = f'{bert_dir}/bert_config.json'
checkpoint_path = f'{bert_dir}/bert_model.ckpt'
dict_path = f'{bert_dir}/vocab.txt'

In [3]:
def load_data(filename):
    D = []
    with open(filename) as f:
        for l in f:
            l = json.loads(l)
            d = {'text': l['text'], 'spo_list': []}
            for spo in l['spo_list']:
                for k, v in spo['object'].items():
                    d['spo_list'].append(
                        (spo['subject'], spo['predicate'] + '_' + k, v)
                    )
            D.append(d)
    return D


# 加载数据集
train_data = load_data('data_origin/train_data/train_data.json')
valid_data = load_data('data_origin/dev_data/dev_data.json')

In [4]:
# 读取schema
with open('data_origin/schema.json') as f:
    id2predicate, predicate2id, n = {}, {}, 0
    predicate2type = {}
    for l in f:
        l = json.loads(l)
        predicate2type[l['predicate']] = (l['subject_type'], l['object_type'])
        for k, _ in sorted(l['object_type'].items()):
            key = l['predicate'] + '_' + k
            id2predicate[n] = key
            predicate2id[key] = n
            n += 1

In [5]:
# 建立分词器
tokenizer = Tokenizer(dict_path, do_lower_case=True)


def search(pattern, sequence):
    """从sequence中寻找子串pattern
    如果找到，返回第一个下标；否则返回-1。
    """
    n = len(pattern)
    for i in range(len(sequence)):
        if sequence[i:i + n] == pattern:
            return i
    return -1


class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids = [], []
        batch_subject_labels, batch_subject_ids, batch_object_labels = [], [], []
        for is_end, d in self.sample(random):
            token_ids, segment_ids = tokenizer.encode(
                d['text'], max_length=maxlen
            )
            # 整理三元组 {s: [(o, p)]}
            spoes = {}
            for s, p, o in d['spo_list']:
                s = tokenizer.encode(s)[0][1:-1]
                p = predicate2id[p]
                o = tokenizer.encode(o)[0][1:-1]
                s_idx = search(s, token_ids)
                o_idx = search(o, token_ids)
                if s_idx != -1 and o_idx != -1:
                    s = (s_idx, s_idx + len(s) - 1)
                    o = (o_idx, o_idx + len(o) - 1, p)
                    if s not in spoes:
                        spoes[s] = []
                    spoes[s].append(o)
            if spoes:
                # subject标签
                subject_labels = np.zeros((len(token_ids), 2))
                for s in spoes:
                    subject_labels[s[0], 0] = 1
                    subject_labels[s[1], 1] = 1
                # 随机选一个subject
                start, end = np.array(list(spoes.keys())).T
                start = np.random.choice(start)
                end = np.random.choice(end[end >= start])
                subject_ids = (start, end)
                # 对应的object标签
                object_labels = np.zeros((len(token_ids), len(predicate2id), 2))
                for o in spoes.get(subject_ids, []):
                    object_labels[o[0], o[2], 0] = 1
                    object_labels[o[1], o[2], 1] = 1
                # 构建batch
                batch_token_ids.append(token_ids)
                batch_segment_ids.append(segment_ids)
                batch_subject_labels.append(subject_labels)
                batch_subject_ids.append(subject_ids)
                batch_object_labels.append(object_labels)
                if len(batch_token_ids) == self.batch_size or is_end:
                    batch_token_ids = sequence_padding(batch_token_ids)
                    batch_segment_ids = sequence_padding(batch_segment_ids)
                    batch_subject_labels = sequence_padding(
                        batch_subject_labels, padding=np.zeros(2)
                    )
                    batch_subject_ids = np.array(batch_subject_ids)
                    batch_object_labels = sequence_padding(
                        batch_object_labels,
                        padding=np.zeros((len(predicate2id), 2))
                    )
                    yield [
                        batch_token_ids, batch_segment_ids,
                        batch_subject_labels, batch_subject_ids,
                        batch_object_labels
                    ], None
                    batch_token_ids, batch_segment_ids = [], []
                    batch_subject_labels, batch_subject_ids, batch_object_labels = [], [], []


def extrac_subject(inputs):
    """根据subject_ids从output中取出subject的向量表征
    """
    output, subject_ids = inputs
    subject_ids = K.cast(subject_ids, 'int32')
    start = batch_gather(output, subject_ids[:, :1])
    end = batch_gather(output, subject_ids[:, 1:])
    subject = K.concatenate([start, end], 2)
    return subject[:, 0]


# 补充输入
subject_labels = Input(shape=(None, 2), name='Subject-Labels')
subject_ids = Input(shape=(2,), name='Subject-Ids')
object_labels = Input(shape=(None, len(predicate2id), 2), name='Object-Labels')

# 加载预训练模型
bert = build_transformer_model(
    config_path=config_path,
    checkpoint_path=checkpoint_path,
    return_keras_model=False,
)

# 预测subject
output = Dense(
    units=2, activation='sigmoid', kernel_initializer=bert.initializer
)(bert.model.output)
subject_preds = Lambda(lambda x: x**2)(output)

subject_model = Model(bert.model.inputs, subject_preds)

# 传入subject，预测object
# 通过Conditional Layer Normalization将subject融入到object的预测中
output = bert.model.layers[-2].get_output_at(-1)
subject = Lambda(extrac_subject)([output, subject_ids])
output = LayerNormalization(conditional=True)([output, subject])
output = Dense(
    units=len(predicate2id) * 2,
    activation='sigmoid',
    kernel_initializer=bert.initializer
)(output)
output = Lambda(lambda x: x**4)(output)
object_preds = Reshape((-1, len(predicate2id), 2))(output)

object_model = Model(bert.model.inputs + [subject_ids], object_preds)

# 训练模型
train_model = Model(
    bert.model.inputs + [subject_labels, subject_ids, object_labels],
    [subject_preds, object_preds]
)
train_model.summary()

mask = bert.model.get_layer('Embedding-Token').output_mask
mask = K.cast(mask, K.floatx())

subject_loss = K.binary_crossentropy(subject_labels, subject_preds)
subject_loss = K.mean(subject_loss, 2)
subject_loss = K.sum(subject_loss * mask) / K.sum(mask)

object_loss = K.binary_crossentropy(object_labels, object_preds)
object_loss = K.sum(K.mean(object_loss, 3), 2)
object_loss = K.sum(object_loss * mask) / K.sum(mask)

train_model.add_loss(subject_loss + object_loss)

optimizer = Adam(learning_rate)
train_model.compile(optimizer=optimizer)


def extract_spoes(text):
    """抽取输入text所包含的三元组
    """
    tokens = tokenizer.tokenize(text, max_length=maxlen)
    mapping = tokenizer.rematch(text, tokens)
    token_ids, segment_ids = tokenizer.encode(text, max_length=maxlen)
    # 抽取subject
    subject_preds = subject_model.predict([[token_ids], [segment_ids]])
    start = np.where(subject_preds[0, :, 0] > 0.4)[0]
    end = np.where(subject_preds[0, :, 1] > 0.4)[0]
    subjects = []
    for i in start:
        j = end[end >= i]
        if len(j) > 0:
            j = j[0]
            subjects.append((i, j))
    if subjects:
        spoes = []
        token_ids = np.repeat([token_ids], len(subjects), 0)
        segment_ids = np.repeat([segment_ids], len(subjects), 0)
        subjects = np.array(subjects)
        # 传入subject，抽取object和predicate
        object_preds = object_model.predict([token_ids, segment_ids, subjects])
        for subject, object_pred in zip(subjects, object_preds):
            start = np.where(object_pred[:, :, 0] > 0.4)
            end = np.where(object_pred[:, :, 1] > 0.4)
            for _start, predicate1 in zip(*start):
                for _end, predicate2 in zip(*end):
                    if _start <= _end and predicate1 == predicate2:
                        spoes.append(
                            ((mapping[subject[0]][0],
                              mapping[subject[1]][-1]), predicate1,
                             (mapping[_start][0], mapping[_end][-1]))
                        )
                        break
        return [(text[s[0]:s[1] + 1], id2predicate[p], text[o[0]:o[1] + 1])
                for s, p, o, in spoes]
    else:
        return []


def combine_spoes(spoes):
    """合并SPO成官方格式
    """
    new_spoes = {}
    for s, p, o in spoes:
        p1, p2 = p.split('_')
        if (s, p1) in new_spoes:
            new_spoes[(s, p1)][p2] = o
        else:
            new_spoes[(s, p1)] = {p2: o}

    return [(k[0], k[1], v) for k, v in new_spoes.items()]


class SPO(tuple):
    """用来存三元组的类
    表现跟tuple基本一致，只是重写了 __hash__ 和 __eq__ 方法，
    使得在判断两个三元组是否等价时容错性更好。
    """
    def __init__(self, spo):
        self.spox = (
            tuple(tokenizer.tokenize(spo[0])),
            spo[1],
            tuple(
                sorted([
                    (k, tuple(tokenizer.tokenize(v))) for k, v in spo[2].items()
                ])
            ),
        )

    def __hash__(self):
        return self.spox.__hash__()

    def __eq__(self, spo):
        return self.spox == spo.spox


def evaluate(data):
    """评估函数，计算f1、precision、recall
    """
    X, Y, Z = 1e-10, 1e-10, 1e-10
    f = open('dev_pred.json', 'w', encoding='utf-8')
    pbar = tqdm()
    for d in data:
        R = combine_spoes(extract_spoes(d['text']))
        T = combine_spoes(d['spo_list'])
        R = set([SPO(spo) for spo in R])
        T = set([SPO(spo) for spo in T])
        X += len(R & T)
        Y += len(R)
        Z += len(T)
        f1, precision, recall = 2 * X / (Y + Z), X / Y, X / Z
        pbar.update()
        pbar.set_description(
            'f1: %.5f, precision: %.5f, recall: %.5f' % (f1, precision, recall)
        )
        s = json.dumps({
            'text': d['text'],
            'spo_list': list(T),
            'spo_list_pred': list(R),
            'new': list(R - T),
            'lack': list(T - R),
        },
                       ensure_ascii=False,
                       indent=4)
        f.write(s + '\n')
    pbar.close()
    f.close()
    return f1, precision, recall


def predict_to_file(in_file, out_file):
    """预测结果到文件，方便提交
    """
    fw = open(out_file, 'w', encoding='utf-8')
    with open(in_file) as fr:
        for l in tqdm(fr):
            l = json.loads(l)
            spoes = combine_spoes(extract_spoes(l['text']))
            spoes = [{
                'subject': spo[0],
                'subject_type': predicate2type[spo[1]][0],
                'predicate': spo[1],
                'object': spo[2],
                'object_type': {
                    k: predicate2type[spo[1]][1][k]
                    for k in spo[2]
                }
            }
                     for spo in spoes]
            l['spo_list'] = spoes
            s = json.dumps(l, ensure_ascii=False)
            fw.write(s + '\n')
    fw.close()


class Evaluator(keras.callbacks.Callback):
    """评估和保存模型
    """
    def __init__(self):
        self.best_val_f1 = 0.

    def on_epoch_end(self, epoch, logs=None):
        f1, precision, recall = evaluate(valid_data)
        if f1 >= self.best_val_f1:
            self.best_val_f1 = f1
            train_model.save_weights('best_model.weights')
        print(
            'f1: %.5f, precision: %.5f, recall: %.5f, best f1: %.5f\n' %
            (f1, precision, recall, self.best_val_f1)
        )

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
Embedding-Token (Embedding)     (None, None, 768)    16226304    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, None, 768)    1536        Input-Segment[0][0]              
____________________________________________________________________________________________

  'be expecting any data to be passed to {0}.'.format(name))
  'be expecting any data to be passed to {0}.'.format(name))


In [6]:
train_generator = data_generator(train_data, batch_size)
evaluator = Evaluator()

In [7]:
train_model.fit_generator(
    train_generator.forfit(),
    steps_per_epoch=len(train_generator),
    epochs=epochs,
    callbacks=[evaluator]
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/20


ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (55,2) and requested shape (3,2)

In [None]:
# train_model.load_weights('best_model.weights')
predict_to_file('data_origin/test1_data/test1_data.json', 'ie_pred.json')
