<a href="https://colab.research.google.com/github/njucs/notebook/blob/master/HuggingfaceTransformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **简介** ###
目前各种Pretraining的Transformer模型层出不穷，虽然这些模型都有开源代码，但是它们的实现各不相同，我们在对比不同模型时也会很麻烦。Huggingface Transformer能够帮我们跟踪流行的新模型，并且提供统一的代码风格来使用BERT、XLNet和GPT等等各种不同的模型。而且它有一个模型仓库，所有常见的预训练模型和不同任务上fine-tuning的模型都可以在这里方便的下载。到目前为止，transformers 提供了超过100种语言的，32种预训练语言模型，简单，强大，高性能。

### **安装** ###

In [None]:
!pip install transformers

###**主要概念**###
1. 只有configuration，models和tokenizer三个主要类。
   - 诸如BertModel的模型(Model)类，包括30+PyTorch模型(torch.nn.Module)和对应的TensorFlow模型(tf.keras.Model)。模型列表可以参考https://huggingface.co/models
   - 诸如BertConfig的配置(Config)类，它保存了模型的相关(超)参数。我们通常不需要自己来构造它。如果我们不需要进行模型的修改，那么创建模型时会自动使用对于的配置
   - 诸如BertTokenizer的Tokenizer类，它保存了词典等信息并且实现了把字符串变成ID序列的功能。
   - 所有这三类对象都可以使用from_pretrained()函数自动通过名字或者目录进行构造，也可以使用save_pretrained()函数保存。
2. 所有的模型都可以通过统一的from_pretrained()函数来实现加载，transformers会处理下载、缓存和其它所有加载模型相关的细节。

##**模型输入**
虽然基于transformer的模型各不相同，但是可以把输入抽象成统一的格式。

In [None]:
# 输入 ID
# 虽然不同的 tokenizer 实现差异很大，但是它们的作用是相同的，即把一个句子变成 Token 的序列，不同的 Token 有不同的整数 ID
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"

# 把句子变成 Token 序列
tokenized_sequence = tokenizer.tokenize(sequence)
print(tokenized_sequence)

# 把句子变成 ID 序列
inputs = tokenizer(sequence)
print(inputs)
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)

# ID 的序列比 Token 要多两个元素，这是 Tokenizer 会自动增加一些特殊的 Token，比如 CLS 和 SEP
# 用 decode 来把 ID 解码成 Token
decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)

###**关于 attention_mask**
如果输入是一个batch，那么会返回Attention Mask，它可以告诉模型哪些部分是padding的，从而要mask掉。

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

# 可以看到第一个 ID 序列后面补了很多零。但这带来一个问题：模型并不知道哪些是 padding 的。
# 我们可以约定 0 就代表 padding，但是用起来会比较麻烦，所以通过一个 attention_mask 明确的标出哪个是 padding 会更加方便。
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
print(padded_sequences["input_ids"])
print(padded_sequences["attention_mask"])

###**关于 token_type_ids**
如果输入的是两个句子，需要明确地告诉模型某个 Token 到底属于哪个句子。就是 token 对应的句子 id，值为 0 或 1（0 表示对应的 token 属于第一句，1 表示属于第二句）。**只能同时输入两个句子作为参数（待确认）。**

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
sequence_c = "I don't know !"

# 会自动帮我们加上 [SEP]
encoded_dict = tokenizer(sequence_b, sequence_c)
decoded = tokenizer.decode(encoded_dict["input_ids"])

print(encoded_dict)
print(decoded)

##**自定义模型（调超参，非全新模型定义）**###

In [None]:
# 需要构造配置类
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification

# 如果修改了核心的超参数，那么就不能使用 from_pretrained 加载预训练的模型了，必须从头开始训练模型
# Tokenizer 一般还是可以复用的

# Case 1: 修改核心超参数，构造 Tokenizer 和模型对象
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512) # 修改超参数
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') # Tokenizer 还是可以复用
model = DistilBertForSequenceClassification(config) # model 不能用 from_pretrained 加载了，需要重新训练

# Case 2: 只改变最后一层，比如把一个两分类的模型改成 10 分类的模型
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10) # 通过设置 num_labels 参数来实现对最后一层的修改
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

'''
class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.distilbert = DistilBertModel(config)
        self.pre_classifier = nn.Linear(config.dim, config.dim)
        self.classifier = nn.Linear(config.dim, config.num_labels)
        self.dropout = nn.Dropout(config.seq_classif_dropout)

        self.init_weights()
'''

##**使用**

In [None]:
# 使用 pipeline

'''
使用预训练模型最简单的方法就是使用pipeline函数，它支持如下的任务：
- 情感分析(Sentiment analysis)：一段文本是正面还是负面的情感倾向
- 文本生成(Text generation)：给定一段文本，让模型补充后面的内容
- 命名实体识别(Name entity recognition)：识别文字中出现的人名地名的命名实体
- 问答(Question answering)：给定一段文本以及针对它的一个问题，从文本中抽取答案
- 填词(Filling masked text)：把一段文字的某些部分mask住，然后让模型填空
- 摘要(Summarization)：根据一段长文本中生成简短的摘要
- 翻译(Translation)：把一种语言的文字翻译成另一种语言
- 特征提取(Feature extraction)：把一段文字用一个向量来表示
'''

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 除了通过名字来制定 model 参数，我们也可以传给 model 一个包含模型的目录的路径，也可以传递一个模型对象。
# 如果我们想传递模型对象，那么也需要传入 tokenizer。
# 我们需要两个类，一个是 AutoTokenizer，我们将使用它来下载和加载与模型匹配的 Tokenizer。
# 另一个是 AutoModelForSequenceClassification。
# 这两个 AutoXXX 类会根据加载的模型自动选择 Tokenizer 和 Model，如果我们提前知道了，也可以直接用对应的模型和 Tokenizer 进行 from_pretrained 调用
# 注意：模型类是与任务相关的，我们这里是情感分类的分类任务，所以是AutoModelForSequenceClassification。

# classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

results = classifier(["We are very happy to show you the 🤗 Transformers library.",
           "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

In [None]:
# 关于 Tokenizer 和 Model
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenizer 的作用大致就是分词，然后把词变成的整数 ID，最终的目的是把一段文本变成 ID 的序列。
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
print(inputs)

# 也可以输入一个 batch
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
# pt_batch 仍然是一个 dict，input_ids 是一个 batch 的 ID 序列，
# 我们可以看到第二个字符串较短，所以它被 padding 成和第一个一样长。
# 如果某个句子的长度超过 max_length，也会被切掉多余的部分。
for key, value in pt_batch.items():
  print(f"{key}: {value.numpy().tolist()}")

# Tokenizer 的处理结果可以输入给模型，对于 PyTorch 需要使用 ** 来展开参数
# Transformers 的所有输出都是 tuple， 默认返回 logits，如果需要概率，可以自己加 softmax
pt_outputs = pt_model(**pt_batch)

# 如果有输出分类对应的标签，那么也可以传入，这样它除了会计算 logits 还会计算 loss
# pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))

# 也可以返回所有的隐状态和 attention
# pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
# all_hidden_states, all_attentions = pt_outputs[-2:]

pt_predictions = F.softmax(pt_outputs[0], dim=-1)
print(pt_predictions)

In [None]:
# 存储和加载使用
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# 还可以轻松的在 PyTorch 和 TensorFlow 之间切换
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)

# 如果用 PyTorch 加载 TensorFlow 模型，则需要设置 from_tf = True：
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory, from_tf=True)

##**常见任务**
- 此处所有任务都是使用自动构造的模型(Auto Models)，它会从某个checkpoint恢复模型参数，并且自动构造网络
- 为了获得好的效果，我们需要找到适合这个任务的checkpoint。这些checkpoint通常是在大量无标注数据上进pretraining并且在某个特定任务上fine-tuning后的结果
- 并不是所有任务都有fine-tuning的模型
- fine-tuning的数据集不见得和我们的实际任务完全匹配，我们可能需要自己fine-tuning
- 为了进行预测，Transformers提供两种方法：pipeline和自己构造模型

###**分类**

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# 使用 pipeline 进行情感分类
nlp = pipeline("sentiment-analysis")
result = nlp("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
result = nlp("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

# 自己构造模型判断两个句子是否相同含义 paraphrase
'''
1. 从 checkpoint 构造一个 Tokenizer 和模型
2. 给定两个输入句子，通过 tokenizer 的__call__方法正确地构造输入，包括 token 类型和 attention mask
3. 把输入传给模型进行预测，输出 logits
4. 计算 softmax 变成概率
'''
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
  print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
# Should not be paraphrase
for i in range(len(classes)):
  print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

###**抽取式问答**

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# 使用 pipeline
nlp = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""
result = nlp(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = nlp(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

# 自己构造模型
'''
1. 构造 Tokenizer 和模型
2. 定义文本和一些问题
3. 对每一个问题构造输入，Tokenizer 会帮我们插入合适的特殊符号和 attention mask
4. 输入模型进行预测，得到是开始和结束下标的 logits
5. 计算 softmax 并且选择概率最大的 start 和 end
6. 最终根据 start 和 end 截取答案文本
'''

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    answer_start = torch.argmax(answer_start_scores)  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    # 拿到 token 的开始和结束下标后，需要用 tokenizer.convert_ids_to_tokens 先把 id 变成 token
    # 然后用 convert_tokens_to_string 把 token 变成字符串
    # 前面的 pipeline 把这些工作都直接帮我们做好了
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}")

###**文本生成**
我们可以用语言模型 fine-tuning 采样的方式一个接一个的生成更多的文本，但是 Transformers 帮我们实现了这些逻辑。

In [None]:
from transformers import pipeline

text_generator = pipeline("text-generation")

# 提供一段 context 的文本，指定最多生成 50 个 Token
# do_sample 为 False 指定选择概率最大的而不是采样，从而每次运行的结果都是固定的
# 默认会使用 gpt-2 的模型来生成文本
# GPT-2、OpenAi-GPT、CTRL、XLNet、Transfo-XL 和 Reformer 等模型都可以用于生成文本
# XLNet 通常需要 padding 一下才会达到比较好的效果，而 GPT-2 则不需要
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
print(text_generator("How to success?", max_length=100, do_sample=False))

###**命名实体识别**
把命名实体识别当成一个序列标注任务

In [None]:
from transformers import pipeline

# pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))

###**摘要**

In [None]:
from transformers import pipeline
from transformers import AutoModelWithLMHead, AutoTokenizer

# pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

# 自己构造
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
#print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(outputs[1])))
print(outputs[-1]) # 还需要从 ID 转化为 Token

###**翻译**

In [None]:
# for using Helsinki-NLP/opus-mt-en-zh
# have to restart the runtime
!pip install sentencepiece

In [None]:
from transformers import pipeline
from transformers import AutoModelWithLMHead, AutoTokenizer
from transformers import pipeline, AutoModelWithLMHead, AutoTokenizer

# pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

# 自定义模型
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)
print(outputs) # 还需要从 ID 转化为 Token

# 中文翻译
model_cn = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
tokenizer_cn = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
translation = pipeline("translation_en_to_zh", model=model_cn, tokenizer=tokenizer_cn)

text = "I like to study Data Science and Machine Learning"
translated_text = translation(text, max_length=40)[0]['translation_text']
print(translated_text)

###**语言模型**
1. 通常是用来预训练基础的模型，然后也可以使用领域的未标注数据来fine-tuning语言模型
2. 比如我们的任务是一个文本分类任务，我们可以基于基础的BERT模型在我们的分类数据上fine-tuning模型。可以用领域的未标注数据对基础的BERT用语言模型这个任务进行再次进行pretraining，然后再用标注的数据fine-tuning分类任务。

In [None]:
# Case 1: fine-tuning MaskedLM
# pipeline
from transformers import pipeline
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
from pprint import pprint

nlp = pipeline("fill-mask")
pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))

# 自己构造 Tokenizer 和模型
'''
1. 构造Tokenizer和模型。比如可以使用DistilBERT从checkpoint加载预训练的模型
2. 构造输入序列，把需要mask的词替换成tokenizer.mask_token
3. 用tokenizer把输入变成ID list
4. 获取预测的结果，它的size是词典大小，表示预测某个词的概率
5. 获取topk个概率最大的词
'''
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")

sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

token_logits = model(input).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
  print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

In [None]:
# Case 2: 预测下一个词
# 根据概率采样下一个词，然后不断的重复这个过程来生成更多的文本

from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

input_ids = tokenizer.encode(sequence, return_tensors="pt")

# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1, :]

# filter
# top_k_top_p_filtering 的作用是把非 top-k 的 logits 变成负无穷大，这样 softmax 时这些项就是 0
# 也可以传入参数 top_p，它的含义是滤掉概率小于它的项目
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample 采样
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)

##**Huggingface Transformer 使用总结**

###**预处理数据**
1. Transformers 处理数据的主要工具是 tokenizer
2. 我们可以使用与某个模型匹配的特定 tokenizer，也可以通过 AutoTokenizer 类自动帮我们选择合适的 tokenizer
3. Tokenizer 的作用是把输入文本切分成 Token，然后把 Token 变成整数 ID，除此之外它也会增加一些额外的特殊 Token 以处理特定的任务
4. 如果我们要使用预训练的模型，那么一定要使用它的 Tokenizer

In [None]:
from transformers import AutoTokenizer

# 使用 AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# 用法1: __call__
encoded_input = tokenizer("Hello, I'm a single sentence!")
# 返回一个 dict，包含 input_ids、token_type_ids 和 attention_mask
print(encoded_input)
# 可以用 decode 方法把 ID 恢复成字符串
# 会增加一些特殊的 Token，比如 [CLS] 和 [SEP]
# 并不是所有的模型都需要增加特殊 Token，我们可以使用参数 add_special_tokens=False 来禁用这个特性
print(tokenizer.decode(encoded_input["input_ids"]))

# 用法2: 处理一个 batch
batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
# 对于大部分应用，batch 的处理通常会需要补齐或者截断，使得 batch 内每个句子都一样长，并返回 tensor
# 如果没有指定最大长度限制，则 truncation 不起作用
batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(batch)

# 用法3: 处理两个输入
# 可以给 __call__ 方法传入两个参数(不是一个 list 的参数)
encoded_input = tokenizer("How old are you?", "I'm 6 years old")
print(encoded_input)
print(tokenizer.decode(encoded_input["input_ids"]))

# 用法4: 传入两个 list，从而进行 batch 处理
batch_sentences = ["Hello I'm a single sentence",
                   "And another sentence",
                   "And the very very last one"]
batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
                             "And I should be encoded with the second sentence",
                             "And I go with the very last one"]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
print(encoded_inputs)
for ids in encoded_inputs["input_ids"]:
  print(tokenizer.decode(ids))
# 也可以 padding 和 truncate 以及返回 tensor
# 此处是确保 batch 内两个句子拼接后长度一致，不是对两个拼接的句子进行 padding 或者 truncation
batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
print(batch)
for ids in batch["input_ids"]:
  print(tokenizer.decode(ids))

# 用法5: Pre-tokenized
# Pre-tokenized 指的是提前进行了分词，但是并没有进行 subword 的处理
# 如果输入是 Pre-tokenized，则可以给定参数 is_split_into_words=True
encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
print(encoded_input)
# 如果处理一个 batch，那么可以传字符串 list
batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
                   ["And", "another", "sentence"],
                   ["And", "the", "very", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
print(encoded_inputs)
# 如果每个输入都是两个句子，可以传入两个这样的字符串 list
batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
                             ["And", "I", "go", "with", "the", "very", "last", "one"]]
encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
print(encoded_inputs)
# 也可以 padding/truncating 并且返回 tensor：
batch = tokenizer(batch_sentences,
                  batch_of_second_sentences,
                  is_split_into_words=True,
                  padding=True,
                  truncation=True,
                  return_tensors="pt")
print(batch)

###**训练和 fine-tuning**

In [None]:
# 使用 Trainer

'''
TrainingArguments参数指定了训练的设置：
1. 输出目录
2. 总的 epochs
3. 训练的 batch_size
4. 预测的 batch_size
5. warmup 的 step 数
6. weight_decay 和 log 目录。

然后使用 trainer.train() 和 trainer.evaluate() 函数就可以进行训练和验证。
我们也可以自己实现模型，但是要求它的 forward 返回的第一个参数是 loss。
'''
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

model = BertForSequenceClassification.from_pretrained("bert-large-uncased")

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

trainer.train()
trainer.evaluate()

# 如果我们想计算除了 loss 之外的指标，需要给 Trainer 传入 compute_metrics 函数
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }