## 中文命名实体识别（NER）

模型: uer/roberta-base-finetuned-cluener2020-chinese

框架: MindSpore + MindNLP
运行环境: 香橙派 AIpro（20T） + Ubuntu + MindSpore 2.6.0 + MindNLP 0.4.1

模型加载方式：本地加载

任务目标: 实现中文命名实体识别，预测每个Token对应的实体类型

In [1]:
import mindspore as ms

print(f"MindSpore 版本: {ms.__version__}")
print(f"当前运行设备: {ms.get_context('device_target')}")

  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)
  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)


MindSpore 版本: 2.6.0
当前运行设备: Ascend


加载模型

In [2]:
# 使用 Hugging Face Hub 上的模型加载方式
from mindnlp.transformers import BertTokenizer, BertForTokenClassification

# 指定 Hugging Face 上的模型 ID
model_name = "uer/roberta-base-finetuned-cluener2020-chinese"

# 从 Hugging Face Hub 网络加载 tokenizer 和模型
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name, from_pt=True)

                                                       mindspore.device_context.ascend.op_precision.op_precision_mode(),
                                                       mindspore.device_context.ascend.op_precision.matmul_allow_hf32(),
                                                       mindspore.device_context.ascend.op_precision.conv_allow_hf32(),
                                                       mindspore.device_context.ascend.op_tuning.op_compile() instead.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 2.121 seconds.
Prefix dict has been built successfully.


定义标签映射

In [3]:
id2label = {
    0: 'O',
    1: 'B-address', 2: 'I-address',
    3: 'B-book', 4: 'I-book',
    5: 'B-company', 6: 'I-company',
    7: 'B-game', 8: 'I-game',
    9: 'B-government', 10: 'I-government',
    11: 'B-movie', 12: 'I-movie',
    13: 'B-name', 14: 'I-name',
    15: 'B-organization', 16: 'I-organization',
    17: 'B-position', 18: 'I-position',
    19: 'B-scene', 20: 'I-scene'
}

推理

In [4]:
import numpy as np
from mindspore import Tensor

def predict(text):
    encoded = tokenizer(text, return_tensors='ms', padding='max_length', truncation=True, max_length=128)
    input_ids = encoded['input_ids']
    attention_mask = encoded['attention_mask']

    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits.asnumpy()
    preds = np.argmax(logits, axis=-1)[0]
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0].asnumpy().tolist())

    results = []
    for token, label_id in zip(tokens, preds):
        if token in ['[PAD]', '[CLS]', '[SEP]']:
            continue
        label = id2label.get(label_id, 'O')
        results.append({'token': token, 'entity': label})
    return results

In [5]:
sample_text = '马化腾是腾讯公司的创始人之一，出生于广东汕头'
output = predict(sample_text)
for item in output:
    print(f"{item['token']} → {item['entity']}")

马 → B-name
化 → I-name
腾 → I-name
是 → O
腾 → B-company
讯 → I-company
公 → I-company
司 → I-company
的 → O
创 → B-position
始 → I-position
人 → I-position
之 → O
一 → O
， → O
出 → O
生 → O
于 → O
广 → B-address
东 → I-address
汕 → I-address
头 → I-address
