## 基于Bert进行实体识别任务微调

致Great，ChallengeHub公众号，微信：1185918903，备注NLP技术交流

知乎：https://www.zhihu.com/people/483684d821a67a8d43ef449ae607ad6b

和鲸主页：https://www.heywhale.com/home/user/profile/58f387e7a686fb29e425d133

和鲸训练营-零基础入门实体识别：https://www.heywhale.com/home/activity/detail/6216f74572960d0017d5e691

#### **所需要的pip包**

* pandas
* numpy
* sklearn
* pytorch
* transformers：
    https://github.com/huggingface/transformers
    
    https://huggingface.co/models
* seqeval



In [1]:
#!pip install transformers seqeval[gpu]

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertConfig, BertForTokenClassification

In [2]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


#### **数据处理**


比赛数据下载地址：商品标题实体识别
https://www.heywhale.com/home/competition/620b34ed28270b0017b823ad

In [3]:
pd.DataFrame([[1,2,3],
             [4,5,6]])

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


In [4]:
# with open('data/05/train_data/train.txt','r',encoding='utf-8') as f:
with open('/home/mw/input/task056960/train.txt','r',encoding='utf-8') as f:

    tmp=[]
    cnt=1
    for line in tqdm(f.read().split('\n')):
        sentence_id=f'train_{cnt}'
        # print(line)
        if line!='\n' and len(line.strip())>0:
            
            word_tags=line.split(' ')
            # print(word_tags,len(word_tags))
            if len(word_tags)==2:
                tmp.append([sentence_id]+word_tags)
            elif len(word_tags)==3: # ['', '', 'O'] 3 空格
                # word=' '.join(word_tags[:-1])
                word='[SEP]'
                tag=word_tags[-1]
                tmp.append([sentence_id,word,tag])
        else:
            cnt+=1

100%|████████████████████████████████████████████████████████████████████| 2288791/2288791 [00:05<00:00, 401430.65it/s]


In [5]:
tmp[0]

['train_1', '手', 'B-40']

In [6]:
data=pd.DataFrame(tmp,columns=['sentence_id','words','tags'])
data.head()

Unnamed: 0,sentence_id,words,tags
0,train_1,手,B-40
1,train_1,机,I-40
2,train_1,三,B-4
3,train_1,脚,I-4
4,train_1,架,I-4


- 验证空格
```
外 I-7
支 B-4
撑 I-4
架 I-4
  O
【 O
女 B-16
```

In [7]:
data[data['sentence_id']=='train_1']

Unnamed: 0,sentence_id,words,tags
0,train_1,手,B-40
1,train_1,机,I-40
2,train_1,三,B-4
3,train_1,脚,I-4
4,train_1,架,I-4
...,...,...,...
61,train_1,+,O
62,train_1,蓝,B-11
63,train_1,牙,I-11
64,train_1,遥,B-11


In [9]:
data.iloc[50]

sentence_id    train_1
words            [SEP]
tags                 O
Name: 50, dtype: object

- 转为句子

In [8]:
data['sentence'] = data[['sentence_id','words','tags']].groupby(['sentence_id'])['words'].transform(lambda x: ' '.join(x))
data['word_labels'] = data[['sentence_id','words','tags']].groupby(['sentence_id'])['tags'].transform(lambda x: ','.join(x))
data.head()

Unnamed: 0,sentence_id,words,tags,sentence,word_labels
0,train_1,手,B-40,手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 ...,"B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-..."
1,train_1,机,I-40,手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 ...,"B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-..."
2,train_1,三,B-4,手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 ...,"B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-..."
3,train_1,脚,I-4,手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 ...,"B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-..."
4,train_1,架,I-4,手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 ...,"B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-..."


In [9]:
data.shape

(2248790, 5)

In [10]:
data['sentence_id'].nunique()

40000

In [11]:
labels_to_ids = {k: v for v, k in enumerate(data.tags.unique())}
ids_to_labels = {v: k for v, k in enumerate(data.tags.unique())}
labels_to_ids

{'B-40': 0,
 'I-40': 1,
 'B-4': 2,
 'I-4': 3,
 'B-14': 4,
 'I-14': 5,
 'B-5': 6,
 'I-5': 7,
 'B-7': 8,
 'I-7': 9,
 'B-11': 10,
 'I-11': 11,
 'B-13': 12,
 'I-13': 13,
 'B-8': 14,
 'I-8': 15,
 'O': 16,
 'B-16': 17,
 'I-16': 18,
 'B-29': 19,
 'I-29': 20,
 'B-9': 21,
 'I-9': 22,
 'B-12': 23,
 'I-12': 24,
 'B-18': 25,
 'I-18': 26,
 'B-1': 27,
 'I-1': 28,
 'B-3': 29,
 'I-3': 30,
 'B-22': 31,
 'I-22': 32,
 'B-37': 33,
 'I-37': 34,
 'B-39': 35,
 'I-39': 36,
 'B-10': 37,
 'I-10': 38,
 'B-36': 39,
 'I-36': 40,
 'B-34': 41,
 'I-34': 42,
 'B-31': 43,
 'I-31': 44,
 'B-38': 45,
 'I-38': 46,
 'B-54': 47,
 'I-54': 48,
 'B-6': 49,
 'I-6': 50,
 'B-30': 51,
 'I-30': 52,
 'B-15': 53,
 'I-15': 54,
 'B-2': 55,
 'I-2': 56,
 'B-49': 57,
 'I-49': 58,
 'B-21': 59,
 'I-21': 60,
 'B-47': 61,
 'I-47': 62,
 'B-23': 63,
 'I-23': 64,
 'B-20': 65,
 'I-20': 66,
 'B-50': 67,
 'I-50': 68,
 'B-46': 69,
 'I-46': 70,
 'B-41': 71,
 'I-41': 72,
 'B-43': 73,
 'I-43': 74,
 'B-48': 75,
 'I-48': 76,
 'B-19': 77,
 'I-19': 78,
 'B-

In [12]:
len(labels_to_ids)

105

In [13]:
data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
# 也可以根据sentence_id去重
data.head()

Unnamed: 0,sentence,word_labels
0,手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 ...,"B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-..."
1,牛 皮 纸 袋 手 提 袋 定 制 l o g o 烘 焙 购 物 服 装 包 装 外 卖 ...,"B-4,I-4,I-4,I-4,B-4,I-4,I-4,B-29,I-29,I-29,I-2..."
2,彩 色 金 属 镂 空 鱼 尾 夹 长 尾 夹 [SEP] 手 帐 设 计 绘 图 文 具 ...,"B-16,I-16,B-12,I-12,B-13,I-13,B-4,I-4,I-4,B-4,..."
3,B o s e [SEP] S o u n d S p o r t [SEP] F r e ...,"B-1,I-1,I-1,I-1,O,B-3,I-3,I-3,I-3,I-3,I-3,I-3,..."
4,壁 挂 炉 专 用 水 空 调 散 热 器 带 风 扇 暖 气 片 水 暖 空 调 明 装 ...,"B-4,I-4,I-4,O,O,B-4,I-4,I-4,B-4,I-4,I-4,B-22,I..."


In [14]:
len(data)

39995

In [15]:
data.iloc[0].sentence

'手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 摄 影 拍 摄 拍 照 抖 音 看 电 视 神 器 三 角 架 便 携 伸 缩 懒 人 户 外 支 撑 架 [SEP] 【 女 神 粉 】 自 带 三 脚 架 + 蓝 牙 遥 控'

In [16]:
data.iloc[0].word_labels

'B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-4,B-7,I-7,B-4,I-4,I-4,B-11,I-11,B-11,I-11,B-4,I-4,I-4,B-5,I-5,B-5,I-5,B-5,I-5,B-13,I-13,B-4,I-4,I-4,I-4,I-4,B-4,I-4,I-4,B-11,I-11,B-11,I-11,B-8,I-8,B-7,I-7,B-4,I-4,I-4,O,O,B-16,I-16,I-16,O,O,O,B-4,I-4,I-4,O,B-11,I-11,B-11,I-11'

In [17]:
len(data['sentence'][0].split(' '))

66

In [18]:
data['sentence'].apply(lambda x:len(x.split(' '))).describe()

count    39995.000000
mean        56.220828
std         13.473300
min          7.000000
25%         46.000000
50%         56.000000
75%         65.000000
max        101.000000
Name: sentence, dtype: float64

#### **构建DataLoader**

In [19]:
MAX_LEN = 105 # 120
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 32
EPOCHS = 10
LEARNING_RATE = 2e-05
MAX_GRAD_NORM = 5
MODEL_NAME='hfl/chinese-roberta-wwm-ext'
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME) # encode_plus()# 整体

BERT做NER 一个棘手部分是 BERT 依赖于 **wordpiece tokenization**，而不是 word tokenization。 

比如：Washington的标签为 "b-gpe",分词之后得到， "Wash", "##ing", "##ton","b-gpe", "b-gpe", "b-gpe"






In [20]:
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):
    """
  
    
    Word piece tokenization使得很难将词标签与单个subword进行匹配。
    这个函数每次次对每个单词进行一个分词，这样方便为每个subword保留正确的标签。 
    当然，它的处理时间有点慢，但它会帮助我们的模型达到更高的精度。
    """

    tokenized_sentence = []
    labels = []

    sentence = sentence.strip()

    for word, label in zip(sentence.split(), text_labels.split(",")):

        # 逐字分词
        tokenized_word = tokenizer.tokenize(word) # id
        n_subwords = len(tokenized_word) # 1

        # 将单个字分词结果追加到句子分词列表
        tokenized_sentence.extend(tokenized_word)

        # 标签同样添加n个subword，与原始word标签一致
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

In [21]:
data.iloc[0]

sentence       手 机 三 脚 架 网 红 直 播 支 架 桌 面 自 拍 杆 蓝 牙 遥 控 三 脚 架 ...
word_labels    B-40,I-40,B-4,I-4,I-4,B-14,I-14,B-5,I-5,B-4,I-...
Name: 0, dtype: object

In [24]:
# tokenize_and_preserve_labels(data.iloc[0]['sentence'],data.iloc[0]['word_labels'],tokenizer)

>这里有其他的处理方式，比如只有第一个subword给定原始标签，其他subword给定一个无关标签

In [26]:
# BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

# https://arxiv.org/abs/1810.04805

In [23]:
encoding_result=tokenizer.encode_plus('这里有其他的处理方式，比如只有第一个subword给定原始标签，其他subword给定一个无关标签')
encoding_result.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [28]:
encoding_result

{'input_ids': [101, 6821, 7027, 3300, 1071, 800, 4638, 1905, 4415, 3175, 2466, 8024, 3683, 1963, 1372, 3300, 5018, 671, 702, 11541, 8204, 10184, 5314, 2137, 1333, 1993, 3403, 5041, 8024, 1071, 800, 11541, 8204, 10184, 5314, 2137, 671, 702, 3187, 1068, 3403, 5041, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [29]:
tokenizer.encode_plus('1[SEP]2')

{'input_ids': [101, 122, 102, 123, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [30]:
# tokenizer.convert_ids_to_tokens([101, 6821, 7027, 3300, 1071, 800, 4638, 1905, 4415, 3175, 2466, 8024, 3683, 1963, 1372, 3300, 5018, 671, 702, 11541, 8204, 10184, 5314, 2137, 1333, 1993, 3403, 5041, 8024, 1071, 800, 11541, 8204, 10184, 5314, 2137, 671, 702, 3187, 1068, 3403, 5041, 102])

In [25]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        # 步骤 1: 对每个句子分词
        sentence = self.data.sentence[index]  
        word_labels = self.data.word_labels[index]  
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)
        
        # 步骤 2: 添加特殊token并添加对应的标签
        tokenized_sentence = ["[CLS]"] + tokenized_sentence + ["[SEP]"] # add special tokens
        labels.insert(0, "O") # 给[CLS] token添加O标签
        labels.insert(-1, "O") # 给[SEP] token添加O标签

        # 步骤 3: 截断/填充
        maxlen = self.max_len

        if (len(tokenized_sentence) > maxlen):
          # 截断
          tokenized_sentence = tokenized_sentence[:maxlen]
          labels = labels[:maxlen]
        else:
          # 填充
          tokenized_sentence = tokenized_sentence + ['[PAD]'for _ in range(maxlen - len(tokenized_sentence))]
          labels = labels + ["O" for _ in range(maxlen - len(labels))]

        # 步骤 4: 构建attention mask
        attn_mask = [1 if tok != '[PAD]' else 0 for tok in tokenized_sentence]
        
        # 步骤 5: 将分词结果转为词表的id表示
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)

        label_ids = [labels_to_ids[label] for label in labels]
  
        
        return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(attn_mask, dtype=torch.long),
              #'token_type_ids': torch.tensor(token_ids, dtype=torch.long),
              'targets': torch.tensor(label_ids, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

按照8：2比列将数据集，划分为训练集和测试集

In [26]:
from sklearn.model_selection import train_test_split
# train_dataset,test_dataset=train_test_split(data,test_size=0.2,random_state=42)

In [27]:
train_size = 0.9
train_dataset = data.sample(frac=train_size,random_state=200) # 训练训练集和验证集 0.9比例
test_dataset = data.drop(train_dataset.index).reset_index(drop=True) # 0.1
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = dataset(train_dataset, tokenizer, MAX_LEN)
testing_set = dataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (39995, 2)
TRAIN Dataset: (35996, 2)
TEST Dataset: (3999, 2)


下面为第一个样本的分词id与标签：

In [28]:
training_set[0]

{'ids': tensor([ 101, 5988, 1094,  102,  704, 2595, 5011, 5708,  102,  704, 2595, 5011,
         1059, 7151, 5052, 1928,  121,  119,  126,  155,  155, 7946, 5682, 5273,
         5682, 5905, 5682, 2110, 4495, 5440, 6407,  683, 4500, 4823, 5162, 5011,
         3296, 5708,  121,  119,  124,  129, 2110, 4495, 3152, 1072, 5041, 2099,
         3717,  102,  523, 6851, 3209,  121,  119,  126,  524,  122,  121,  121,
         3118, 5905, 5682, 1928,  116,  123, 3118, 5011,  102,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0]),
 'mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0,

In [29]:
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["ids"]), training_set[0]["targets"]):
  print('{0:10}  {1}   {2}'.format(token, label,ids_to_labels[label.numpy().tolist()]))

[CLS]       16   O
虎           16   O
冠           16   O
[SEP]       16   O
中           2   B-4
性           3   I-4
笔           3   I-4
芯           3   I-4
[SEP]       16   O
中           2   B-4
性           3   I-4
笔           3   I-4
全           12   B-13
针           13   I-13
管           13   I-13
头           13   I-13
0           25   B-18
.           26   I-18
5           26   I-18
m           26   I-18
m           26   I-18
黑           17   B-16
色           18   I-16
红           17   B-16
色           18   I-16
蓝           17   B-16
色           18   I-16
学           14   B-8
生           15   I-8
考           6   B-5
试           7   I-5
专           7   I-5
用           7   I-5
碳           2   B-4
素           3   I-4
笔           3   I-4
替           3   I-4
芯           3   I-4
0           25   B-18
.           26   I-18
3           26   I-18
8           26   I-18
学           14   B-8
生           15   I-8
文           2   B-4
具           3   I-4
签           16   O
字           16   O
水    

创建Pytorch的DataLoader

In [30]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

#### **定义网络**

- 模型结构：BertForTokenClassification

- 预训练权重： "bert-base-uncased"

In [37]:
len(labels_to_ids)

105

In [38]:
model = BertForTokenClassification.from_pretrained(MODEL_NAME, num_labels=len(labels_to_ids))
model.to(device)

Some weights of the model checkpoint at hfl/chinese-roberta-wwm-ext were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at h

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

#### **训练模型**



In [39]:
# ids.shape

In [40]:
ids = training_set[0]["ids"].unsqueeze(0)
mask = training_set[0]["mask"].unsqueeze(0)
targets = training_set[0]["targets"].unsqueeze(0) # 真实标签
ids = ids.to(device)
mask = mask.to(device)
targets = targets.to(device)
outputs = model(input_ids=ids, attention_mask=mask, labels=targets) # 输出有两个：一个为loss和一个为logits
initial_loss = outputs[0]
initial_loss

tensor(4.4463, device='cuda:0', grad_fn=<NllLossBackward>)

模型输出logits大小为 (batch_size, sequence_length, num_labels):

In [41]:
tr_logits = outputs[1]
tr_logits.shape

torch.Size([1, 105, 105])

设置优化器Adam

In [42]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [43]:
# 训练函数
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # 将model设置为train模式
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['ids'].to(device, dtype = torch.long) #(4,91)
        mask = batch['mask'].to(device, dtype = torch.long) #(4,91)
        targets = batch['targets'].to(device, dtype = torch.long)#(4,91)
        
        
        outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
        loss, tr_logits = outputs[0],outputs[1]
        # print(outputs.keys())
        # print(loss)
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += targets.size(0)
        
        if idx % 500==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 500 training steps: {loss_step}")
           
        # 计算准确率
        flattened_targets = targets.view(-1) # 真实标签 大小 (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # 模型输出shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # 取出每个token对应概率最大的标签索引 shape (batch_size * seq_len,)
        # MASK：PAD
        active_accuracy = mask.view(-1) == 1 # shape (batch_size * seq_len,)
        targets = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_preds.extend(predictions)
        tr_labels.extend(targets)
        
        tmp_tr_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # 梯度剪切
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # loss反向求导
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

训练模型

In [44]:
for epoch in range(EPOCHS):
    print(f"Training epoch: {epoch + 1}")
    train(epoch)

Training epoch: 1
Training loss per 500 training steps: 4.467368125915527
Training loss epoch: 1.4484657735824584
Training accuracy epoch: 0.4421112756600715
Training epoch: 2
Training loss per 500 training steps: 0.7548065185546875
Training loss epoch: 0.6540753860473633
Training accuracy epoch: 0.7134929778351335
Training epoch: 3
Training loss per 500 training steps: 0.5577887892723083
Training loss epoch: 0.5146245424747466
Training accuracy epoch: 0.7632178597980472
Training epoch: 4
Training loss per 500 training steps: 0.5142039060592651
Training loss epoch: 0.4485367271900177
Training accuracy epoch: 0.7861171425380934
Training epoch: 5
Training loss per 500 training steps: 0.3622732162475586
Training loss epoch: 0.40163199257850646
Training accuracy epoch: 0.8036167708797474
Training epoch: 6
Training loss per 500 training steps: 0.33577480912208557
Training loss epoch: 0.3655732891559601
Training accuracy epoch: 0.817925539477108
Training epoch: 7
Training loss per 500 traini

#### **评估模型**

验证集评估

In [45]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):
            
            ids = batch['ids'].to(device, dtype = torch.long)
            mask = batch['mask'].to(device, dtype = torch.long)
            targets = batch['targets'].to(device, dtype = torch.long)
            
            # loss, eval_logits = model(input_ids=ids, attention_mask=mask, labels=targets)
            outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
            loss, eval_logits = outputs[0],outputs[1]
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += targets.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # 计算准确率
            flattened_targets = targets.view(-1) # 大小 (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # 大小 (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # 大小 (batch_size * seq_len,)
            active_accuracy = mask.view(-1) == 1 # 大小 (batch_size * seq_len,)
            targets = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(targets)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy
    
    #print(eval_labels)
    #print(eval_preds)

    labels = [ids_to_labels[id.item()] for id in eval_labels]
    predictions = [ids_to_labels[id.item()] for id in eval_preds]

    #print(labels)
    #print(predictions)
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [46]:
labels, predictions = valid(model, testing_loader)

Validation loss per 100 evaluation steps: 0.5618929266929626
Validation loss per 100 evaluation steps: 0.48445309949393317
Validation loss per 100 evaluation steps: 0.47979988669281576
Validation loss per 100 evaluation steps: 0.47257913079768715
Validation loss per 100 evaluation steps: 0.4780060602096548
Validation loss per 100 evaluation steps: 0.4786392832825522
Validation loss per 100 evaluation steps: 0.47741880680677695
Validation loss per 100 evaluation steps: 0.4783220507108876
Validation loss per 100 evaluation steps: 0.47767649447724464
Validation loss per 100 evaluation steps: 0.47792532454584863
Validation loss per 100 evaluation steps: 0.4774629267660173
Validation loss per 100 evaluation steps: 0.4775117500384865
Validation Loss: 0.4775778596666124
Validation Accuracy: 0.7795549782141706


In [47]:
# len(predictions),len(labels)

In [48]:
tmp=[]
for tags in data['word_labels']:
    tmp.extend(tags.split(','))
pd.Series(tmp).value_counts()

O       319186
I-4     312855
B-4     167244
I-18    141111
I-38    131220
         ...  
B-53         5
B-35         3
I-35         2
B-26         1
I-26         1
Length: 105, dtype: int64

In [49]:
ids_to_labels[18]

'I-16'

In [50]:
from seqeval.metrics import classification_report

print(classification_report([labels], [predictions])) # [] 避免报错TypeError: Found input variables without list of list.

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           1       0.82      0.92      0.87     22532
          10       0.41      0.46      0.44      8269
          11       0.69      0.77      0.73     55217
          12       0.71      0.79      0.75     11104
          13       0.63      0.68      0.65     61019
          14       0.84      0.89      0.86     20291
          15       0.54      0.64      0.59       763
          16       0.76      0.86      0.81     23692
          17       0.00      0.00      0.00        30
          18       0.65      0.69      0.67     55887
          19       0.00      0.00      0.00       115
           2       0.15      0.20      0.17      2764
          20       0.12      0.08      0.10       508
          21       0.10      0.32      0.16       539
          22       0.28      0.31      0.30      9511
          23       0.00      0.00      0.00        23
          24       0.00      0.00      0.00         4
          25       0.00    

#### **预测**



In [84]:
tmp=[]
cnt=1
with open('/home/mw/input/task056960/sample_per_line_preliminary_A.txt','r',encoding='utf-8') as f:
    for line in tqdm(f.read().split('\n')):
        sentence_id=f'test_{cnt}'
        for word in line:
            # print(word,sentence_id,cnt)
            if word.strip():
                tmp.append([sentence_id,word,'O'])
            else:
                tmp.append([sentence_id,'[SEP]','O'])
                
        cnt+=1

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 12213.48it/s]


In [85]:
test_data=pd.DataFrame(tmp,columns=['sentence_id','words','tags'])
test_data.head()

Unnamed: 0,sentence_id,words,tags
0,test_1,O,O
1,test_1,P,O
2,test_1,P,O
3,test_1,O,O
4,test_1,闪,O


In [86]:
test_data['sentence'] = test_data[['sentence_id','words','tags']].groupby(['sentence_id'])['words'].transform(lambda x: ' '.join(x))
test_data['word_labels'] = test_data[['sentence_id','words','tags']].groupby(['sentence_id'])['tags'].transform(lambda x: ','.join(x))
test_data.head()

Unnamed: 0,sentence_id,words,tags,sentence,word_labels
0,test_1,O,O,O P P O 闪 充 充 电 器 [SEP] X 9 0 7 0 [SEP] X 9 0 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
1,test_1,P,O,O P P O 闪 充 充 电 器 [SEP] X 9 0 7 0 [SEP] X 9 0 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
2,test_1,P,O,O P P O 闪 充 充 电 器 [SEP] X 9 0 7 0 [SEP] X 9 0 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
3,test_1,O,O,O P P O 闪 充 充 电 器 [SEP] X 9 0 7 0 [SEP] X 9 0 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
4,test_1,闪,O,O P P O 闪 充 充 电 器 [SEP] X 9 0 7 0 [SEP] X 9 0 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."


In [87]:
test_data = test_data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
# 也可以根据sentence_id去重
test_data.head()

Unnamed: 0,sentence,word_labels
0,O P P O 闪 充 充 电 器 [SEP] X 9 0 7 0 [SEP] X 9 0 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
1,O W I N 净 水 器 家 用 厨 房 欧 恩 科 技 反 渗 透 纯 水 机 O - ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
2,教 学 教 具 磁 条 贴 磁 性 条 背 胶 ( 3 M ) 软 磁 条 [SEP] 黑 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
3,【 限 时 促 销 】 笔 记 本 文 具 创 意 复 古 古 风 本 子 线 装 记 事 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."
4,2 0 本 装 a 5 笔 记 本 子 文 具 学 生 B 5 软 抄 本 子 记 事 本 ...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,..."


In [88]:
test_data['sentence'][0]

'O P P O 闪 充 充 电 器 [SEP] X 9 0 7 0 [SEP] X 9 0 7 7 [SEP] R 5 [SEP] 快 充 头 通 用 手 机 数 据 线 [SEP] 套 餐 【 2 . 4 充 电 头 + 数 据 线 [SEP] 】 [SEP] 安 卓 [SEP] 1 . 5 m'

In [90]:
''.join( test_data['sentence'][0].split())

'OPPO闪充充电器[SEP]X9070[SEP]X9077[SEP]R5[SEP]快充头通用手机数据线[SEP]套餐【2.4充电头+数据线[SEP]】[SEP]安卓[SEP]1.5m'

In [92]:
tokenize_and_preserve_labels(test_data.iloc[0]['sentence'],test_data.iloc[0]['word_labels'],tokenizer)

(['o',
  'p',
  'p',
  'o',
  '闪',
  '充',
  '充',
  '电',
  '器',
  '[SEP]',
  'x',
  '9',
  '0',
  '7',
  '0',
  '[SEP]',
  'x',
  '9',
  '0',
  '7',
  '7',
  '[SEP]',
  'r',
  '5',
  '[SEP]',
  '快',
  '充',
  '头',
  '通',
  '用',
  '手',
  '机',
  '数',
  '据',
  '线',
  '[SEP]',
  '套',
  '餐',
  '【',
  '2',
  '.',
  '4',
  '充',
  '电',
  '头',
  '+',
  '数',
  '据',
  '线',
  '[SEP]',
  '】',
  '[SEP]',
  '安',
  '卓',
  '[SEP]',
  '1',
  '.',
  '5',
  'm'],
 ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'])

In [93]:
test_set = dataset(test_data, tokenizer, MAX_LEN)

In [94]:
test_set[0]


{'ids': tensor([ 101,  157,  158,  158,  157, 7306, 1041, 1041, 4510, 1690,  102,  166,
          130,  121,  128,  121,  102,  166,  130,  121,  128,  128,  102,  160,
          126,  102, 2571, 1041, 1928, 6858, 4500, 2797, 3322, 3144, 2945, 5296,
          102, 1947, 7623,  523,  123,  119,  125, 1041, 4510, 1928,  116, 3144,
         2945, 5296,  102,  524,  102, 2128, 1294,  102,  122,  119,  126,  155,
          102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0]),
 'mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0,

In [95]:
for token, label in zip(tokenizer.convert_ids_to_tokens(test_set[0]["ids"]), test_set[0]["targets"]):
  print('{0:10}  {1}   {2}'.format(token, label,ids_to_labels[label.numpy().tolist()]))

[CLS]       16   O
o           16   O
p           16   O
p           16   O
o           16   O
闪           16   O
充           16   O
充           16   O
电           16   O
器           16   O
[SEP]       16   O
x           16   O
9           16   O
0           16   O
7           16   O
0           16   O
[SEP]       16   O
x           16   O
9           16   O
0           16   O
7           16   O
7           16   O
[SEP]       16   O
r           16   O
5           16   O
[SEP]       16   O
快           16   O
充           16   O
头           16   O
通           16   O
用           16   O
手           16   O
机           16   O
数           16   O
据           16   O
线           16   O
[SEP]       16   O
套           16   O
餐           16   O
【           16   O
2           16   O
.           16   O
4           16   O
充           16   O
电           16   O
头           16   O
+           16   O
数           16   O
据           16   O
线           16   O
[SEP]       16   O
】           16   O
[SEP]       

In [108]:
test_params = {'batch_size': 1,
                'shuffle': False,
                'num_workers': 0
                }
testing_loader = DataLoader(test_set, **test_params)

In [121]:
def predict():
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in tqdm(enumerate(testing_loader)):
            
            ids = batch['ids'].to(device, dtype = torch.long)
            mask = batch['mask'].to(device, dtype = torch.long)
            targets = batch['targets'].to(device, dtype = torch.long)
            
            # loss, eval_logits = model(input_ids=ids, attention_mask=mask, labels=targets)
            outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
            loss, eval_logits = outputs[0],outputs[1]
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += targets.size(0)
        
#             if idx % 100==0:
#                 print(f"Validation steps: {idx}")
              
            active_logits = eval_logits.view(-1, model.num_labels) # 大小 (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # 大小 (batch_size * seq_len,)
            eval_preds.append(flattened_predictions)
            
    # predictions = [ids_to_labels[id.item()] for id in eval_preds]
    return eval_preds

In [122]:
test_preds=predict()

10000it [02:55, 57.01it/s]


In [123]:
len(test_preds)

10000

In [142]:
test_data['sentence'][0].split()

['O',
 'P',
 'P',
 'O',
 '闪',
 '充',
 '充',
 '电',
 '器',
 '[SEP]',
 'X',
 '9',
 '0',
 '7',
 '0',
 '[SEP]',
 'X',
 '9',
 '0',
 '7',
 '7',
 '[SEP]',
 'R',
 '5',
 '[SEP]',
 '快',
 '充',
 '头',
 '通',
 '用',
 '手',
 '机',
 '数',
 '据',
 '线',
 '[SEP]',
 '套',
 '餐',
 '【',
 '2',
 '.',
 '4',
 '充',
 '电',
 '头',
 '+',
 '数',
 '据',
 '线',
 '[SEP]',
 '】',
 '[SEP]',
 '安',
 '卓',
 '[SEP]',
 '1',
 '.',
 '5',
 'm']

In [141]:
[ids_to_labels[id.item()] for id in test_preds[0]][1:60]

['B-37',
 'I-37',
 'I-37',
 'I-37',
 'B-11',
 'I-11',
 'B-4',
 'I-4',
 'I-4',
 'O',
 'B-38',
 'I-38',
 'I-38',
 'I-38',
 'I-38',
 'O',
 'B-38',
 'I-38',
 'I-38',
 'I-38',
 'I-38',
 'O',
 'B-38',
 'I-38',
 'O',
 'B-4',
 'I-4',
 'I-4',
 'B-11',
 'I-11',
 'B-40',
 'I-40',
 'B-4',
 'I-4',
 'I-4',
 'O',
 'O',
 'O',
 'O',
 'B-18',
 'I-18',
 'I-18',
 'B-4',
 'I-4',
 'I-4',
 'O',
 'B-4',
 'I-4',
 'I-4',
 'O',
 'O',
 'O',
 'B-37',
 'I-37',
 'O',
 'B-18',
 'I-18',
 'I-18',
 'O']

In [151]:
# test_sents[0]

In [150]:
# y_pred[0][1:len(test_sents[0])+1]

In [156]:
y_preds=[]
for pred in test_preds:
    y_preds.append([ids_to_labels[id.item()] for id in pred])

In [143]:
test_file='data/preliminary_test_a/sample_per_line_preliminary_A.txt'
test_sents=[]
with open(test_file, 'r', encoding='utf-8') as f:
    for line in f.read().split('\n'):
        test_sents.append(line)

In [157]:
list_results=[]
for sent, ner_tag in zip(test_sents, y_preds):
    line_result=[]
    for word, tag in zip(sent, ner_tag[1:len(test_sents)+1]):
        line_result.append((word,tag))
    list_results.append(line_result)    

In [158]:
with open('crf.txt','w',encoding='utf-8') as f:
    for i,line_result in enumerate(list_results):
        for word,tag in line_result:
            f.write(f'{word} {tag}\n')
        if i<len(list_results)-1:
            f.write('\n')

In [91]:
# sentence ='OPPO闪充充电器[SEP]X9070[SEP]X9077[SEP]R5[SEP]快充头通用手机数据线[SEP]套餐【2.4充电头+数据线[SEP]】[SEP]安卓[SEP]1.5m'



# inputs = tokenizer(sentence, padding='max_length', truncation=True, max_length=MAX_LEN, return_tensors="pt")

# # 加载到gpu
# ids = inputs["input_ids"].to(device)
# mask = inputs["attention_mask"].to(device)
# # 输入到模型
# outputs = model(ids, mask)
# logits = outputs[0]

# active_logits = logits.view(-1, model.num_labels) # 大小 (batch_size * seq_len, num_labels)
# flattened_predictions = torch.argmax(active_logits, axis=1) # 大小 (batch_size*seq_len,) 

# tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
# token_predictions = [ids_to_labels[i] for i in flattened_predictions.cpu().numpy()]
# wp_preds = list(zip(tokens, token_predictions)) # tuple = (wordpiece, prediction)

# word_level_predictions = []
# for pair in wp_preds:
#   if (pair[0].startswith(" ##")) or (pair[0] in ['[CLS]', '[SEP]', '[PAD]']):
#     # skip prediction
#     continue
#   else:
#     word_level_predictions.append(pair[1])

# # 拼接文本
# str_rep = " ".join([t[0] for t in wp_preds if t[0] not in ['[CLS]', '[SEP]', '[PAD]']]).replace(" ##", "")
# print(str_rep)
# print(word_level_predictions)

oppo 闪 充 充 电 器 x9070 x9077 r5 快 充 头 通 用 手 机 数 据 线 套 餐 【 2 . 4 充 电 头 + 数 据 线 】 安 卓 1 . 5m
['B-37', 'B-11', 'I-4', 'B-4', 'I-4', 'I-4', 'B-38', 'I-38', 'I-38', 'B-38', 'I-38', 'I-38', 'B-38', 'I-38', 'B-4', 'I-4', 'I-4', 'B-11', 'I-11', 'B-40', 'I-40', 'B-4', 'I-4', 'I-4', 'O', 'O', 'O', 'B-18', 'I-18', 'I-18', 'B-4', 'I-4', 'I-4', 'O', 'B-4', 'I-4', 'I-4', 'O', 'B-37', 'I-37', 'B-18', 'I-18', 'O']


#### **保存模型**

保存模型词汇表 、模型权重、配置文件，之后可以用 `from_pretrained()` 



In [57]:
import os

directory = "./model"

if not os.path.exists(directory):
    os.makedirs(directory)

# 保存tokenizer
tokenizer.save_vocabulary(directory)
# 保存权重和配置文件
model.save_pretrained(directory)
print('All files saved')
print('This tutorial is completed')

All files saved
This tutorial is completed
