# 简介

https://aistudio.baidu.com/aistudio/competition/detail/32

https://kexue.fm/archives/7321

事件抽取 (Event Extraction, EE)是指从自然语言文本中抽取事件并识别事件类型和事件元素的技术。事件抽取是智能风控、智能投研、舆情监控等人工智能应用的重要技术基础，受到学术界和工业界的广泛关注。事件抽取任务涉及事件句抽取、触发词识别、事件类型判别、论元抽取等复杂技术，具有一定的挑战。

本次竞赛将提供业界规模最大的中文事件抽取数据集 DuEE，旨在为研究者提供学术交流平台，进一步提升事件关系抽取技术的研究水平，推动相关人工智能应用的发展。

## Agenda

    2020/3/10 	启动竞赛报名，发放样例数据
    2020/3/31 	开放评测入口和排行榜，对报名者发放全部训练数据和第一批测试数据
    2020/5/12 	报名截止
    2020/5/13 	发放最终测试数据
    2020/5/20 	系统结果提交截止
    2020/5/30 	公布竞赛结果，接收系统报告和论文
    2020/6/30 	论文提交截止日期
    2020/7 	在“语言与智能高峰论坛”上交流和颁奖

## 数据介绍 Data

本次竞赛数据集共包含 65个已定义好的事件类型约束和1.7万中文句子，其中包括1.2万训练集，0.15万验证集和0.35万测试集，共分为以下5个部分：

1.训练集：共1.2万个句子，包含句子中对应的事件类型、论元及其角色，用于竞赛模型训练。

2.验证集：共0.15万个句子，包含句子中对应的事件类型、论元及其角色，用于竞赛模型训练和参数调试。

3.事件类型约束：共定义了65个事件类型及其对应的121个论元角色类别。

4.测试集1：约0.15万个句子，不包含句子中对应的事件类型、论元及其角色，用于参赛者在平台上自助提交模型预测结果、验证效果。

5.测试集2：本次竞赛最终测试集，约0.2万个句子，不包含句子对应的事件类型、论元及其角色。另外为了防止针对测试集的调试，数据中将会额外加入混淆数据。该数据用于作为最终的系统效果评估。

### 数据样本 Sample Data

平台提供的数据为JSON格式，单行JSON格式如下：
The data is provided in JSON format as follows:

    {
        "text":"泰安今早发生2.9级地震！靠近这个国家森林公园", // input sentence
        "id":"15eb3f6208c67a0081164fd18ea5674c",   // input sentence id 
        "event_list":[     // all events from text
            {
                "arguments":[   // all event arguments with the event type
                    {
                        "argument_start_index":2, // event argument start position
                        "role":"时间",  // event argument role
                        "argument":"今早", // event argument mentioned in text
                        "alias":[  // other event argument mentions
                        ]
                    }
                ],
                "trigger":"地震",  // trigger word
                "trigger_start_index":10,  // trigger word start postion
                "class":"灾害/意外",  // event class
                "event_type":"灾害/意外-地震" // event type 
            }
        ]
    }


In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)  # 设置显示数据的最大列数，防止出现省略号…，导致数据显示不全
pd.set_option('expand_frame_repr', False)  # 当列太多时不自动换行

import seaborn as sns
sns.set(font='Arial Unicode MS')  # 解决Seaborn中文显示问题
import sys
sys.path.append('/Users/luoyonggui/PycharmProjects/mayiutils_n1/mayiutils/data_prepare')
from data_explore import DataExplore as de

  import pandas.util.testing as tm


# baseline by苏剑林
* 百度LIC2020的事件抽取赛道，非官方baseline
* 直接用RoBERTa+CRF
* 在第一期测试集上能达到0.78的F1，优于官方baseline

## 提交结果汇总
### 训练了7个epoch，每个epoch时长3h
2020-04-15 11:32  排名75/947  
p0.803	r0.773	f1:0.788

In [1]:
import json
import numpy as np
from bert4keras.backend import keras, K
from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer
from bert4keras.optimizers import Adam
from bert4keras.snippets import sequence_padding, DataGenerator
from bert4keras.snippets import open
from bert4keras.layers import ConditionalRandomField
from keras.layers import Dense
from keras.models import Model
from tqdm import tqdm
import pylcs

Using TensorFlow backend.


In [2]:
bert_dir = '/Users/luoyonggui/Documents/nlpdata/chinese_roberta_wwm_ext_L-12_H-768_A-12'
config_path = f'{bert_dir}/bert_config.json'
checkpoint_path = f'{bert_dir}/bert_model.ckpt'
dict_path = f'{bert_dir}/vocab.txt'

In [3]:
def load_data(filename):
    D = []
    with open(filename) as f:
        for l in f:
            l = json.loads(l)
            arguments = {}
            for event in l['event_list']:
                for argument in event['arguments']:
                    key = argument['argument']
                    value = (event['event_type'], argument['role'])
                    arguments[key] = value
            D.append((l['text'], arguments))
    return D


# 读取数据
train_data = load_data('data_origin/train_data/train.json')
valid_data = load_data('data_origin/dev_data/dev.json')

In [4]:
train_data[:2]

[('雀巢裁员4000人：时代抛弃你时，连招呼都不会打！',
  {'雀巢': ('组织关系-裁员', '裁员方'), '4000人': ('组织关系-裁员', '裁员人数')}),
 ('美国“未来为”子公司大幅度裁员，这是为什么呢？任正非正式回应', {'美国“未来为”子公司': ('组织关系-裁员', '裁员方')})]

In [5]:
# 读取schema
with open('data_origin/event_schema/event_schema.json') as f:
    id2label, label2id, n = {}, {}, 0
    for l in f:
        l = json.loads(l)
        for role in l['role_list']:
            key = (l['event_type'], role['role'])
            id2label[n] = key
            label2id[key] = n
            n += 1
    num_labels = len(id2label) * 2 + 1

In [6]:
# 建立分词器
tokenizer = Tokenizer(dict_path, do_lower_case=True)

In [7]:
def search(pattern, sequence):
    """从sequence中寻找子串pattern
    如果找到，返回第一个下标；否则返回-1。
    """
    n = len(pattern)
    for i in range(len(sequence)):
        if sequence[i:i + n] == pattern:
            return i
    return -1


class data_generator(DataGenerator):
    """数据生成器
    """
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, (text, arguments) in self.sample(random):
            token_ids, segment_ids = tokenizer.encode(text, max_length=maxlen)
            labels = [0] * len(token_ids)
            for argument in arguments.items():
                a_token_ids = tokenizer.encode(argument[0])[0][1:-1]
                start_index = search(a_token_ids, token_ids)
                if start_index != -1:
                    labels[start_index] = label2id[argument[1]] * 2 + 1
                    for i in range(1, len(a_token_ids)):
                        labels[start_index + i] = label2id[argument[1]] * 2 + 2
            batch_token_ids.append(token_ids)
            batch_segment_ids.append(segment_ids)
            batch_labels.append(labels)
            if len(batch_token_ids) == self.batch_size or is_end:
                batch_token_ids = sequence_padding(batch_token_ids)
                batch_segment_ids = sequence_padding(batch_segment_ids)
                batch_labels = sequence_padding(batch_labels)
                yield [batch_token_ids, batch_segment_ids], batch_labels
                batch_token_ids, batch_segment_ids, batch_labels = [], [], []

In [8]:
def viterbi_decode(nodes, trans):
    """Viterbi算法求最优路径
    其中nodes.shape=[seq_len, num_labels],
        trans.shape=[num_labels, num_labels].
    """
    labels = np.arange(num_labels).reshape((1, -1))
    scores = nodes[0].reshape((-1, 1))
    scores[1:] -= np.inf  # 第一个标签必然是0
    paths = labels
    for l in range(1, len(nodes)):
        M = scores + trans + nodes[l].reshape((1, -1))
        idxs = M.argmax(0)
        scores = M.max(0).reshape((-1, 1))
        paths = np.concatenate([paths[:, idxs], labels], 0)
    return paths[:, scores[:, 0].argmax()]


def extract_arguments(text):
    """arguments抽取函数
    """
    tokens = tokenizer.tokenize(text)
    while len(tokens) > 510:
        tokens.pop(-2)
    mapping = tokenizer.rematch(text, tokens)
    token_ids = tokenizer.tokens_to_ids(tokens)
    segment_ids = [0] * len(token_ids)
    nodes = model.predict([[token_ids], [segment_ids]])[0]
    trans = K.eval(CRF.trans)
    labels = viterbi_decode(nodes, trans)
    arguments, starting = [], False
    for i, label in enumerate(labels):
        if label > 0:
            if label % 2 == 1:
                starting = True
                arguments.append([[i], id2label[(label - 1) // 2]])
            elif starting:
                arguments[-1][0].append(i)
            else:
                starting = False
        else:
            starting = False

    return {
        text[mapping[w[0]][0]:mapping[w[-1]][-1] + 1]: l
        for w, l in arguments
    }


def evaluate(data):
    """评测函数（跟官方评测结果不一定相同，但很接近）
    """
    X, Y, Z = 1e-10, 1e-10, 1e-10
    for text, arguments in tqdm(data):
        inv_arguments = {v: k for k, v in arguments.items()}
        pred_arguments = extract_arguments(text)
        pred_inv_arguments = {v: k for k, v in pred_arguments.items()}
        Y += len(pred_inv_arguments)
        Z += len(inv_arguments)
        for k, v in pred_inv_arguments.items():
            if k in inv_arguments:
                # 用最长公共子串作为匹配程度度量
                l = pylcs.lcs(v, inv_arguments[k])
                X += 2. * l / (len(v) + len(inv_arguments[k]))
    f1, precision, recall = 2 * X / (Y + Z), X / Y, X / Z
    return f1, precision, recall


def predict_to_file(in_file, out_file):
    """预测结果到文件，方便提交
    """
    fw = open(out_file, 'w', encoding='utf-8')
    with open(in_file) as fr:
        for l in tqdm(fr):
            l = json.loads(l)
            arguments = extract_arguments(l['text'])
            event_list = []
            for k, v in arguments.items():
                event_list.append({
                    'event_type': v[0],
                    'arguments': [{
                        'role': v[1],
                        'argument': k
                    }]
                })
            l['event_list'] = event_list
            l = json.dumps(l, ensure_ascii=False)
            fw.write(l + '\n')
    fw.close()


class Evaluator(keras.callbacks.Callback):
    """评估和保存模型
    """
    def __init__(self):
        self.best_val_f1 = 0.

    def on_epoch_end(self, epoch, logs=None):
        f1, precision, recall = evaluate(valid_data)
        if f1 >= self.best_val_f1:
            self.best_val_f1 = f1
            model.save_weights('best_model.weights')
        print(
            'f1: %.5f, precision: %.5f, recall: %.5f, best f1: %.5f\n' %
            (f1, precision, recall, self.best_val_f1)
        )

In [9]:
# 基本信息
maxlen = 128
epochs = 5
batch_size = 32
learning_rate = 2e-5
crf_lr_multiplier = 100  # 必要时扩大CRF层的学习率

In [10]:
model = build_transformer_model(
    config_path,
    checkpoint_path,
)

output = Dense(num_labels)(model.output)
CRF = ConditionalRandomField(lr_multiplier=crf_lr_multiplier)
output = CRF(output)

model = Model(model.input, output)
model.summary()

model.compile(
    loss=CRF.sparse_loss,
    optimizer=Adam(learning_rate),
    metrics=[CRF.sparse_accuracy]
)


Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
Embedding-Token (Embedding)     (None, None, 768)    16226304    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, None, 768)    1536        Input-Segment[0][0]              
____________________________________________________________________________________________

In [11]:
train_generator = data_generator(train_data, batch_size)
evaluator = Evaluator()

In [12]:
"""
Epoch 7/20
374/374 [==============================] - 10741s 29s/step - loss: 2.7422 - sparse_accuracy: 0.9492

100%|██████████| 1498/1498 [05:01<00:00,  4.98it/s]

f1: 0.77810, precision: 0.77810, recall: 0.77810, best f1: 0.77810
"""
model.load_weights('best_model.weights')
model.fit_generator(
    train_generator.forfit(),
    steps_per_epoch=len(train_generator),
    epochs=epochs,
    callbacks=[evaluator]
)


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/5


100%|██████████| 1498/1498 [04:26<00:00,  5.63it/s]


f1: 0.77174, precision: 0.76939, recall: 0.77411, best f1: 0.77174

Epoch 2/5
  1/374 [..............................] - ETA: 5:01:59 - loss: 1.7293 - sparse_accuracy: 0.9605

KeyboardInterrupt: 

In [13]:

predict_to_file('data_origin/test1_data/test1.json', 'ee_pred.json')

1489it [04:44,  5.24it/s]
