# 中文观点抽取类情感分析

## 1 中文观点抽取任务简介

观点抽取：对于给定的文本 `d`，系统需要根据文本的内容，给出其中描述的评价对象 `a`，其中评价对象 `a` 一定在文本 `d` 中出现。数据集中每个样本是一个二元组 `<d, a>`，样例如下：
```
输入文本（d）：重庆老灶火锅还是很赞的，有机会可以尝试一下！
评价对象（a）：重庆老灶火锅
```

## 2 基于抽取式MRC框架的中文观点抽取实现

对于观点抽取任务，目前传统的方法使用序列标注的方法去解决。

本文方法的主要思想是，将一个序列标注任务转换为一个抽取式阅读理解任务去解决。

基于 `Pre-training + Fine-tuning` 模式的抽取式机器阅读理解架构如下，本文预训练模型使用的是 `ernie-gram-zh`：

![](https://ai-studio-static-online.cdn.bcebos.com/a479d066c6e340f2a0a28ab580dcc73393e072e51e1b462aae3aeb9cedb62c51)


### 2.1 数据处理

最关键的部分是，将观点抽取数据集转换成 `SQuAD` 兼容的格式。注意，这里的 `SQuAD` 兼容格式指的不是 `SQuAD` 原始的 `json` 文件格式，而是抽取式机器阅读理解统一的输入样本格式。

以 `COTE-DP`为例，`PaddleNLP` 自带的观点抽取数据集格式如下：
```
{
	'tokens': ['重', '庆', '老', '灶', '火', '锅', '还', '是', '很', '赞', '的', '，', '有', '机', '会', '可', '以', '尝', '试', '一', '下', '！'], 
    'labels': [0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], 
    'entity': '重庆老灶火锅'
}
```
转换成 `SQuAD` 兼容的数据格式，即抽取式 `MRC` 模型输入样本的格式如下：
```
{
	'id': 'qid0',
    'title': '',
    'context': '重庆老灶火锅还是很赞的，有机会可以尝试一下！',
    'question': '评价对象',
    'answers': ['重庆老灶火锅'],
    'answer_starts': [0]
}
```


In [6]:
# 将 PaddleNLP 更新到最新版本
!pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple/

In [2]:
from paddlenlp.datasets import load_dataset
from paddlenlp.datasets import MapDataset

In [3]:
def create_dataset(data_name='dp', split='train'):
    """根据 data_name 和 split 参数创建数据集
    Args:
        data_name: str, 'dp', 'bd', 'mfw'
        split: str, 'train', ''test
    
    """

    # 由于 COTE 数据集只提供了训练集和测试集，所以 split 参数只能是 'train' 或 'test'
    assert isinstance(split, str), 'split must be str, it could be "train" or "test".'

    if split == 'train':
        is_test = False
    elif split == 'test':
        is_test = True
    else:
        raise ValueError('split must be "train" or "test".')

    # 根据 data_name 和 split 创建数据集
    dataset = load_dataset('cote', data_name, splits=[split], lazy=False)

    # 下面我们将数据集转换成 SQuAD 兼容的格式
    examples = []
    for idx, example in enumerate(dataset):
        qid = 'qid' + str(idx)
        # tokens 对应 MRC 中的 context
        context = ''.join(example['tokens'])
        # 注意，原始的样本好多是以空格或NBSP字符开头，对于基于指针的方法这类位置敏感的方法而言
        # 需要将开头的空格去掉
        context = context.strip()
        
        # 原数据集里没有 question，需要我们自己设定一个。对于观点抽取任务的问题，
        # 我们可以设为：'这句话的评价对象是什么？'
        # 这里我简单的将问题设为：'评价对象'
        # 问题的设定，对模型性能的影响，这里我没有做过多研究
        # 感兴趣可以将 question 设定为一个不相干的问题试试看，
        # 比如：'你吃过了吗？'
        question = '评价对象'  
        if not is_test:  # 训练集
            answer = example['entity']

            # 过滤掉没有答案的样本
            answer_start = context.find(answer)
            if answer_start < 0:
                continue

            new_example = {
                'id': qid,
                'title': '',
                'context': context,
                'question': question,
                'answers': [answer],
                'answer_starts': [answer_start]
            }
        else:  # 测试集   
            new_example = {
                'id': qid,
                'title':'',
                'context': context,
                'question': question,
                'answers': [],
                'answer_starts': []
            }

        examples.append(new_example)
    
    # 根据样本列表创建一个 MapDataset 对象
    dataset = MapDataset(examples)

    # 返回数据集
    return dataset

我们来看一下我们创建的数据集格式是否和抽取式 `MRC` 数据集（比如，Dureader-robust）格式一致。

In [4]:
train_cote_dp = create_dataset('dp', split='train')
train_robust = load_dataset('dureader_robust', splits='train')

In [5]:
train_cote_dp[0]

In [6]:
train_robust[0]

### 2.2 模型训练与评估

In [None]:
import json
import math
import os
import random
import time
from functools import partial

import numpy as np
import paddle
from paddle.io import DataLoader
from paddle.io import BatchSampler
from paddle.io import DistributedBatchSampler
from paddlenlp.data import Dict
from paddlenlp.data import Pad
from paddlenlp.data import Stack
from paddlenlp.data import Tuple
from paddlenlp.datasets import load_dataset
from paddlenlp.datasets import MapDataset
from paddlenlp.ops.optimizer import AdamW
from paddlenlp.transformers import BertForQuestionAnswering
from paddlenlp.transformers import BertTokenizer
from paddlenlp.transformers import ErnieForQuestionAnswering
from paddlenlp.transformers import ErnieTokenizer
from paddlenlp.transformers import ErnieGramForQuestionAnswering
from paddlenlp.transformers import ErnieGramTokenizer
from paddlenlp.transformers import RobertaForQuestionAnswering
from paddlenlp.transformers import RobertaTokenizer
from paddlenlp.transformers import LinearDecayWithWarmup

from sklearn.model_selection import train_test_split

from config import Config
from dataset import create_dataset
from utils import CrossEntropyLossForSQuAD
from utils import evaluate
from utils import predict
from utils import prepare_train_features
from utils import prepare_validation_features
from utils import set_seed

In [None]:
MODEL_CLASSES = {
    "bert": (BertForQuestionAnswering, BertTokenizer),
    "ernie": (ErnieForQuestionAnswering, ErnieTokenizer),
    "ernie_gram": (ErnieGramForQuestionAnswering, ErnieGramTokenizer),
    "roberta": (RobertaForQuestionAnswering, RobertaTokenizer)
}

#### 2.2.1 模型训与评估练代码的封装

In [None]:
def do_train(args):
    
    paddle.set_device(args.device)
    set_seed(args)

    args.model_type = args.model_type.lower()
    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)

    train_ds = create_dataset(data_name=args.data_name, split='train')
    train_ds, dev_ds = train_test_split(train_ds, test_size=0.3, random_state=args.seed)
    train_ds, dev_ds = MapDataset(train_ds), MapDataset(dev_ds)

    train_trans_func = partial(
        prepare_train_features, 
        max_seq_length=args.max_seq_length, 
        doc_stride=args.doc_stride,
        tokenizer=tokenizer
    )

    train_ds.map(train_trans_func, batched=True)

    dev_trans_func = partial(
        prepare_validation_features, 
        max_seq_length=args.max_seq_length, 
        doc_stride=args.doc_stride,
        tokenizer=tokenizer
    )

    dev_ds.map(dev_trans_func, batched=True)

    # 定义BatchSampler
    train_batch_sampler = DistributedBatchSampler(
            dataset=train_ds, 
            batch_size=args.batch_size, 
            shuffle=True
    )
    dev_batch_sampler = BatchSampler(
        dataset=dev_ds, 
        batch_size=args.batch_size, 
        shuffle=False
    )
    # 定义batchify_fn
    train_batchify_fn = lambda samples, fn=Dict({
        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
        "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
        "start_positions": Stack(dtype="int64"),
        "end_positions": Stack(dtype="int64")
    }): fn(samples)

    dev_batchify_fn = lambda samples, fn=Dict({
        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
        "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id)
    }): fn(samples)

    # 构造DataLoader
    train_data_loader = DataLoader(
        dataset=train_ds,
        batch_sampler=train_batch_sampler,
        collate_fn=train_batchify_fn,
        return_list=True
    )

    dev_data_loader =  DataLoader(
        dataset=dev_ds,
        batch_sampler=dev_batch_sampler,
        collate_fn=dev_batchify_fn,
        return_list=True
    )

    output_dir = os.path.join(args.output_dir, 'best_model')
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    model = model_class.from_pretrained(args.model_name_or_path)
    # model = model_class.from_pretrained(output_dir)


    num_training_steps = args.max_steps if args.max_steps > 0 else len(
        train_data_loader) * args.num_train_epochs
    num_train_epochs = math.ceil(num_training_steps / len(train_data_loader))

    num_batches = len(train_data_loader)

    lr_scheduler = LinearDecayWithWarmup(
        learning_rate=args.learning_rate, 
        total_steps=num_training_steps,
        warmup=args.warmup_proportion
    )

    decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
    optimizer = paddle.optimizer.AdamW(
        learning_rate=lr_scheduler,
        epsilon=args.adam_epsilon,
        parameters=model.parameters(),
        weight_decay=args.weight_decay,
        apply_decay_param_fun=lambda x: x in decay_params
    )

    criterion = CrossEntropyLossForSQuAD()

    best_val_f1 = 0.0

    global_step = 0
    tic_train = time.time()
    for epoch in range(1, num_train_epochs + 1):
        for step, batch in enumerate(train_data_loader, start=1):

            global_step += 1
            
            input_ids, segment_ids, start_positions, end_positions = batch
            logits = model(input_ids=input_ids, token_type_ids=segment_ids)
            loss = criterion(logits, (start_positions, end_positions))

            if global_step % args.log_steps == 0 :
                # print("global step %d, epoch: %d, batch: %d/%d, loss: %.5f,  speed: %.2f step/s" % (
                #     global_step, epoch, step, num_batches, loss, args.log_steps / (time.time() - tic_train)))
                
                print("global step %d, epoch: %d, batch: %d/%d, loss: %.5f,  speed: %.2f step/s, lr: %1.16e" % (
                    global_step, epoch, step, num_batches, loss, args.log_steps / (time.time() - tic_train), lr_scheduler.get_lr()))
                
                tic_train = time.time()
            
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()

            if global_step % args.save_steps == 0 or global_step == num_training_steps:
                em, f1 = evaluate(model=model, data_loader=dev_data_loader)

                print("global step: %d, eval dev Exact Mactch: %.5f, f1_score: %.5f" % (global_step, em, f1))

                if f1 > best_val_f1:
                    best_val_f1 = f1

                    print("save model at global step: %d, best eval f1_score: %.5f" % (global_step, best_val_f1))

                    model.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                if global_step == num_training_steps:
                    break


#### 2.2.2 模型训练参数的定义

In [None]:
args = Config(model_type='ernie_gram', 
              model_name_or_path='ernie-gram-zh', 
              data_name='mfw',  # dp, bd, mfw
              output_dir='./outputs/cote/mfw',  # './outputs/cote/dp', './outputs/cote/bd', './outputs/cote/mfw'
              
              max_seq_length=128, 
              batch_size=32,
              learning_rate=5e-5,
              num_train_epochs=10,
              log_steps=20,          # dp, mfw == 20,  bd == 10
              save_steps=200,        # dp, mfw == 200, bd == 100
              doc_stride=64,
              warmup_proportion=0.1,
              weight_decay=0.01)

#### 2.2.2 启动训练

In [None]:
do_train(args)

### 2.3 模型预测

#### 2.3.1 模型预测代码的封装

In [None]:
def do_predict(args):

    paddle.set_device(args.device)

    output_dir = os.path.join(args.output_dir, "best_model")

    # 1. 加载测试集
    test_ds = create_dataset(data_name=args.data_name, split='test')

    model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
    tokenizer = tokenizer_class.from_pretrained(output_dir)

    # 2. 转化为 id
    test_trans_func = partial(
        prepare_validation_features, 
        max_seq_length=args.max_seq_length, 
        doc_stride=args.doc_stride,
        tokenizer=tokenizer
    )
    test_ds.map(test_trans_func, batched=True)

    # test BatchSampler
    test_batch_sampler = BatchSampler(
        dataset=test_ds, 
        batch_size=args.batch_size, 
        shuffle=False
    )

    # test dataset features batchify
    test_batchify_fn = lambda samples, fn=Dict({
        "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
        "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id)
    }): fn(samples)

    # test DataLoader
    test_data_loader =  DataLoader(
        dataset=test_ds,
        batch_sampler=test_batch_sampler,
        collate_fn=test_batchify_fn,
        return_list=True
    )

    model = model_class.from_pretrained(output_dir)
    
    all_predictions = predict(model, test_data_loader)

    # Can also write all_nbest_json and scores_diff_json files if needed
    with open('COTE_' + args.data_name.upper() + '.tsv', "w", encoding='utf-8') as writer:
        writer.write('index\tprediction\n')
        idx = 0
        for example in test_data_loader.dataset.data:
            writer.write(str(idx) + '\t' + all_predictions[example['id']] + '\n')
            idx += 1

    count = 0
    for example in test_data_loader.dataset.data:
        count += 1
        print()
        print('问题：',example['question'])
        print('原文：',''.join(example['context']))
        print('答案：',all_predictions[example['id']])
        if count >= 10:
            break

#### 2.3.2 启动模型预测

In [None]:
do_predict(args)

### 2.4 实验结果

||DP|BD|MFW|
|---|---|---|---|
|Official|0.8496|0.8649|0.8732||
|Ours|**0.913**|**0.8994**|**0.8907**|
|diff|**0.0634**|**0.0345**|**0.0175**|


## 3 意见反馈

关于本项目有什么问题或意见可随时在NLP打卡营的 `QQ` 群里 `@我爱志方小姐`。如果您不在 `QQ` 群里里，也欢迎您在评论区留下您宝贵的建议~

请点击[此处](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576)查看本环境基本用法.  <br>
Please click [here ](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576) for more detailed instructions. 