Prompt分类（Prompt-based Classification）是一种新兴的文本分类技术，它通过将任务特定的提示文本（Prompt Text）与输入文本（Input Text）一起输入到预训练语言模型（Pre-trained Language Model）中来实现文本分类。Prompt分类具有高度灵活性和可扩展性，并已经在多个NLP任务中取得了优异的性能。

Prompt分类的基本思想是将文本分类任务转化为掩码语言模型（Masked Language Modeling，MLM）任务，通过预测掩码位置（[MASK]）的输出来判断类别。例如，通过文本描述判定天气好坏，类别【好、坏】：常规方法是在BERT模型之后添加一个分类层，哪个输出节点概率最大则划分到哪一类别；而Prompt分类方法是在输入文本前后添加提示文本，并在类别位置添加掩码标记：

- 输入：[CLS] 文字描述：今天阳光明媚，微风拂面。 天气：[MASK] [SEP]
- 输出：天气：好
Prompt分类的优势是可以利用预训练语言模型的强大表达能力和泛化能力，无需额外增加参数或进行微调。Prompt分类的挑战是如何设计合适的提示文本来引导模型进行正确的推理和预测。

- 步骤1：加载BERT模型 或 T5模型
- 步骤2：将样本加入自定义prompt
- 步骤3：使用[MASK]分类进行训练和预测
- 步骤4：通过上述步骤，请回答下面问题
    - Prompt分类比BERT分类相比，在精度上有什么区别？
    - 自定义prompt对模型的精度是否有影响？可以尝试2种不同的prompt。

In [2]:
import torch
from transformers import T5Tokenizer
from transformers import T5Config
from transformers import T5ForConditionalGeneration

In [7]:
import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

pretrained_model = "IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese"

special_tokens = ["<extra_id_{}>".format(i) for i in range(100)]
tokenizer = T5Tokenizer.from_pretrained(pretrained_model,
                                       do_lower_case=True,
                                       max_length=512,
                                       truncation=True,
                                       additional_sepcial_tokens=special_tokens)

config = T5Config.from_pretrained(pretrained_model)
model = T5ForConditionalGeneration.from_pretrained(pretrained_model, config=config)
model.reset_token_embeddings(len(tokenizer))
model.eval()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device=ce)

# tokenize
text = "意图识别任务：还有双鸭山到淮阴的汽车票吗13号的 这篇文章的类别是什么？Travel-Query/Music-Play/FilmTele-Play/Video-Play/Radio-Listen/HomeAppliance-Control/Weather-Query/Alarm-Update/Calendar-Query/TVProgram-Play/Audio-Play/Other"
encode_dict = tokenizer(text, max_length=512, padding='max_length',truncation=True)

inputs = {
  "input_ids": torch.tensor([encode_dict['input_ids']]).long().to(device),
  "attention_mask": torch.tensor([encode_dict['attention_mask']]).long().to(device),
  }

# generate answer
logits = model.generate(
  input_ids = inputs['input_ids'],
  max_length=100, 
  do_sample= True
  # early_stopping=True,
  )

logits=logits[:,1:]
predict_label = [tokenizer.decode(i,skip_special_tokens=True) for i in logits]
print(predict_label)

OSError: Can't load tokenizer for 'IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'IDEA-CCNL/Randeng-T5-784M-MultiTask-Chinese' is the correct path to a directory containing all relevant files for a T5Tokenizer tokenizer.

In [8]:
# 读取数据集，这里是直接联网读取，也可以通过下载文件，再读取
import pandas as pd
import matplotlib.pyplot as plt

data_dir = 'https://mirror.coggle.club/dataset/coggle-competition/'
train_data = pd.read_csv(data_dir + 'intent-classify/train.csv', sep='\t', header=None)
test_data = pd.read_csv(data_dir + 'intent-classify/test.csv', sep='\t', header=None)

In [9]:
train_data

Unnamed: 0,0,1
0,还有双鸭山到淮阴的汽车票吗13号的,Travel-Query
1,从这里怎么回家,Travel-Query
2,随便播放一首专辑阁楼里的佛里的歌,Music-Play
3,给看一下墓王之王嘛,FilmTele-Play
4,我想看挑战两把s686打突变团竞的游戏视频,Video-Play
...,...,...
12095,一千六百五十三加三千一百六十五点六五等于几,Calendar-Query
12096,稍小点客厅空调风速,HomeAppliance-Control
12097,黎耀祥陈豪邓萃雯畲诗曼陈法拉敖嘉年杨怡马浚伟等到场出席,Radio-Listen
12098,百事盖世群星星光演唱会有谁,Video-Play


In [14]:
# label处理为T5所需格式
label_text = '/'.join(train_data[1].unique())

In [12]:
%%time

text = "意图识别任务：【播放周杰伦的歌曲】 这篇文章的类别是什么？Travel-Query/Music-Play/FilmTele-Play/Video-Play/Radio-Listen/HomeAppliance-Control/Weather-Query/Alarm-Update/Calendar-Query/TVProgram-Play/Audio-Play/Other"
encode_dict = tokenizer(text)

inputs = {
  "input_ids": torch.tensor([encode_dict['input_ids']]).long().to(device),
  "attention_mask": torch.tensor([encode_dict['attention_mask']]).long().to(device),
}

logits = model.generate(
    input_ids = inputs['input_ids'],
    # attention_mask = inputs['attention_mask'],
    max_length=20, 
    do_sample= False
)

logits=logits[:,1:]
predict_label = [tokenizer.decode(i,skip_special_tokens=True) for i in logits]
predict_label

NameError: name 'tokenizer' is not defined

In [13]:
from tqdm import tqdm_notebook

In [15]:
pred_label = []

for train_text in tqdm_notebook(train_data[0].values):
    text = f"意图识别任务: {train_text} 这篇文章的类别是什么? {label_text}"
    encode_dict = tokenizer(text, max_length=512, padding='max_length', truncation=True)
    
    inputs = {
        "input_ids": torch.tensor([encode_dict['input_ids']]).long().to(device),
        "attention_mask": torch.tensor([encode_dict['attention_mask']]).long().to(device),
    }
    
    # 生成answer
    logits = model.generate(
        input_ids = inputs['input_ids'],
        max_length=100,
        do_sample=True
    )
    
    logits = logits[:, :]
    pred_label += [tokenizer.decode(i,skip_special_tokens=True) for i in logits]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for train_text in tqdm_notebook(train_data[0].values):


  0%|          | 0/12100 [00:00<?, ?it/s]

NameError: name 'tokenizer' is not defined

In [None]:
pd.DataFrame({
    'ID': range(1, len(pred_label) + 1),
    'Target': pred_label,
}).to_csv('nlp_submit.csv', index=None)