# PaddleNlp

## PaddlePaddle install 

* https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html

## PaddleNlp docs.

* https://paddlenlp.readthedocs.io/zh/latest -- 文档
* api: https://paddlenlp.readthedocs.io/en/latest/source/paddlenlp.transformers.tokenizer_utils.html

## source code 

### all supported transformer tokenizer

* paddlenlp/transformers/auto/tokenizer.py
* paddlenlp/transformers/ernie/tokenizer.py  # ernie相关的

### transformer model configs

* paddlenlp/transformers/ernie/modeling.py

### transformer model zoom, and supported language.

* https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer  -- bert  有Portuguese 支持
* https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/GPT/contents.html -- GPT 有Portuguese 支持

## PaddleNlp 数据集、模型下载到默认路径

* A: 内置的数据集、模型默认会下载到$HOME/.paddlenlp/下，通过配置环境变量可下载到指定路径：

    *（1）Linux下，设置 export PPNLP_HOME="xxxx"，注意不要设置带有中文字符的路径。

    *（2）Windows下，同样配置环境变量 PPNLP_HOME 到其他非中文字符路径，重启即可。

## Transformer 返回结果含义

* model_outputs.py 中 BaseModelOutputWithPoolingAndCrossAttentions 有详细说明
    * https://huggingface.co/transformers/v3.0.2/model_doc/bert.html 解释, BertModel
* modeling.py 中有 ErnieForSequenceClassification， ErnieForQuestionAnswering， ErnieForTokenClassification， ErnieLMPredictionHead， ErnieForMultipleChoice， ErnieForMaskedLM

### Transformers pytorch return

* last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)):
    * Sequence of hidden-states at the output of the last layer of the model.
* pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)):
    * Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.

    * This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

* hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True):
    * Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
    * Hidden-states of the model at the output of each layer plus the initial embedding outputs.

* attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True):
    * Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).
    * Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

### ernie 超详细中文预训练模型ERNIE使用指南
* https://aistudio.baidu.com/paddle/forum/topic/show/954092

### tokenizer 参数文档

* https://github.com/PaddlePaddle/PaddleNLP/blob/2000ea2ea88f5ef7804d1ee31f6b57f9feb6f006/paddlenlp/transformers/tokenizer_utils.py#L622-L625

In [50]:
from paddlenlp.transformers import AutoModel, AutoTokenizer
from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer

import paddle

model_name = 'ernie-3.0-medium-zh'
# model_name = "ernie-3.0-xbase-zh"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = ErnieForTokenClassification.from_pretrained(model_name)
# model = AutoModel.from_pretrained(model_name)

# 请问有没有像琉璃神社一样的资源站 有和琉璃神社一样的网站吗？

# inputs = tokenizer("请问有没有像琉璃神社一样的资源站",
#                    "有和琉璃神社一样的网站吗？", max_seq_len=64, pad_to_max_seq_len=True)

query = "请问有没有像琉璃神社一样的资源站"
title = "有和琉璃神社一样的网站吗？"
max_seq_length = 64
pad_to_max = True

inputs = tokenizer(query, title, max_seq_len=max_seq_length, pad_to_max_seq_len=pad_to_max, return_attention_mask=True)

print(inputs)

print(len(inputs['input_ids']))
# inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
# # logits = model(**inputs, return_dict=True, output_hidden_states=True, output_attentions=True)
# logits = model(**inputs)
# print(logits)


[32m[2022-10-16 23:04:31,673] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-medium-zh'.[0m
[32m[2022-10-16 23:04:31,680] [    INFO][0m - Already cached /mnt/d/dataset/nlp/paddle/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt[0m
[32m[2022-10-16 23:04:31,736] [    INFO][0m - tokenizer config file saved in /mnt/d/dataset/nlp/paddle/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json[0m
[32m[2022-10-16 23:04:31,743] [    INFO][0m - Special tokens file saved in /mnt/d/dataset/nlp/paddle/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json[0m


{'input_ids': [1, 647, 358, 9, 340, 9, 635, 3016, 1764, 292, 210, 7, 314, 5, 138, 362, 563, 2, 9, 14, 3016, 1764, 292, 210, 7, 314, 5, 305, 563, 1114, 12045, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
64


In [14]:
print(logits.hidden_states[2])

Tensor(shape=[1, 15, 768], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [[[ 0.06013136,  0.08001949,  0.05949559, ...,  0.03834157,
           0.16428161, -0.02553395],
         [ 0.17710927,  1.07631350,  0.22704059, ..., -0.01483721,
          -0.58641475, -0.37783957],
         [ 0.63684541,  0.03024112,  0.39377603, ...,  0.12226901,
           0.43927503,  1.50784373],
         ...,
         [ 1.05572450,  2.52401543, -0.73081750, ..., -0.19437474,
           0.94409478,  1.50580680],
         [-0.25393343,  0.31314465,  0.06259530, ..., -0.44231522,
          -0.11833040, -0.35460761],
         [ 0.34953618, -1.13955617,  0.45758203, ...,  0.39407942,
           0.17412712, -0.47145003]]])


In [2]:
tokenizer_pt = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')


[32m[2022-08-10 15:14:33,981] [    INFO][0m - Downloading tokenizer_config.json from https://bj.bcebos.com/paddlenlp/models/community/neuralmind/bert-base-portuguese-cased/tokenizer_config.json[0m
100%|██████████| 167/167 [00:00<00:00, 89.6kB/s]
[32m[2022-08-10 15:14:34,215] [    INFO][0m - We are using <class 'paddlenlp.transformers.bert.tokenizer.BertTokenizer'> to load 'neuralmind/bert-base-portuguese-cased'.[0m
[32m[2022-08-10 15:14:34,217] [    INFO][0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/neuralmind/bert-base-portuguese-cased/vocab.txt and saved to /home/jeffye/.paddlenlp/models/neuralmind/bert-base-portuguese-cased[0m
[32m[2022-08-10 15:14:34,218] [    INFO][0m - Downloading vocab.txt from https://bj.bcebos.com/paddlenlp/models/community/neuralmind/bert-base-portuguese-cased/vocab.txt[0m
100%|██████████| 205k/205k [00:00<00:00, 956kB/s] 
[32m[2022-08-10 15:14:34,699] [    INFO][0m - Downloading https://bj.bcebos.com/paddlenlp/models/commun

In [None]:
# pretrained_model = AutoModel.from_pretrained("neuralmind/bert-base-portuguese-cased")
# del pretrained_model
# pretrained_model = AutoModel.from_pretrained("pierreguillou/bert-base-cased-squad-v1.1-portuguese")
# del pretrained_model
pretrained_model = AutoModel.from_pretrained("bert-base-multilingual-uncased")


### tokenizer api
* 文档： https://paddlenlp.readthedocs.io/en/latest/source/paddlenlp.transformers.ernie.tokenizer.html
* huggingface: https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer

In [24]:
from paddlenlp.transformers import AutoModel, AutoTokenizer
from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer

import paddle
# from paddlenlp.transformers import ErnieTinyTokenizer
# tokenizer = ErnieTinyTokenizer.from_pretrained('ernie-tiny')

# model_name = 'ernie-3.0-medium-zh'
# # model_name = "ernie-3.0-xbase-zh"
tokenizer = AutoTokenizer.from_pretrained(model_name)

[32m[2023-05-07 21:31:14,609] [    INFO][0m - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-medium-zh'.[0m
[32m[2023-05-07 21:31:14,610] [    INFO][0m - Already cached C:\Users\73915\.paddlenlp\models\ernie-3.0-medium-zh\ernie_3.0_medium_zh_vocab.txt[0m
[32m[2023-05-07 21:31:14,644] [    INFO][0m - tokenizer config file saved in C:\Users\73915\.paddlenlp\models\ernie-3.0-medium-zh\tokenizer_config.json[0m
[32m[2023-05-07 21:31:14,646] [    INFO][0m - Special tokens file saved in C:\Users\73915\.paddlenlp\models\ernie-3.0-medium-zh\special_tokens_map.json[0m


In [30]:
vocabs = tokenizer.get_vocab()
# 分词
tokens = tokenizer.tokenize("this is我所要的")

# tokenizer.get_vocab()[token] is equivalent to tokenizer.convert_tokens_to_ids(token) when token is in the vocab.
strings = tokenizer.convert_tokens_to_ids(tokens)
token_ids_vocab2index = [vocabs[tok] for tok in tokens]
# text to ids
token_ids = tokenizer("this is我所要的")

# id to token
id2tokens = tokenizer.convert_ids_to_tokens(token_ids['input_ids'])

print('tokenized:', tokens)
print('convert_tokens_to_ids:', strings)
print('token to index', token_ids_vocab2index)
print('token ids:', token_ids, '开始插入1[CLS], 结束加入[SEP]2')
print('id2token:', id2tokens)
print('[MASK] id=', tokenizer.convert_tokens_to_ids(['[MASK]']))
print('id=0, token=', tokenizer.convert_ids_to_tokens([0]))


tokenized: ['this', 'is', '我', '所', '要', '的']
convert_tokens_to_ids: [3730, 2775, 75, 110, 41, 5]
token to index [3730, 2775, 75, 110, 41, 5]
token ids: {'input_ids': [1, 3730, 2775, 75, 110, 41, 5, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]} 开始插入1[CLS], 结束加入[SEP]2
id2token: ['[CLS]', 'this', 'is', '我', '所', '要', '的', '[SEP]']
[MASK] id= [3]
id=0, token= ['[PAD]']


### Transformer (Ernie) modeling 说明
* 源码位置： paddlenlp\transformers\ernie\modeling.py
* 可以查看ModelOutput，每种model 返回格式： paddlenlp\transformers\model_outputs.py

## LAC

* 

In [1]:
from paddlenlp import Taskflow

lac = Taskflow("lexical_analysis")
lac("LAC是个优秀的分词工具")


[{'text': 'LAC是个优秀的分词工具',
  'segs': ['LAC', '是', '个', '优秀', '的', '分词', '工具'],
  'tags': ['nz', 'v', 'q', 'a', 'u', 'n', 'n']}]