<a href="https://colab.research.google.com/github/h0806449f/PyTorch/blob/main/TRY_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLP 模型圍繞的兩大主題
1. 上下文語意 -> 判斷數個詞向量 是否相似
2. 主題式語意 -> 判斷數個詞向量
3. BERT 1. + 2.

# **== 0. 簡介: transformer可以做什麼 ==**
from HuggingFace

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

from transformers import pipeline

In [None]:
# 情緒分析
classifier = pipeline(model = "distilbert-base-uncased-finetuned-sst-2-english", # Dfault model
                      task = "sentiment-analysis")


classifier("首次嘗試使用NLP相關模型, 模型來自於HuggingFace, 看起來有點厲害")

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'NEGATIVE', 'score': 0.970554769039154}]

In [None]:
# 零樣本 - 文本分類
classifier = pipeline(model = "facebook/bart-large-mnli", # Default model
                      task = "zero-shot-classification")

classifier("This is a course about the Transformers library",
           candidate_labels=["education", "politics", "business"])

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445989489555359, 0.11197412759065628, 0.04342695698142052]}

In [None]:
# 文本生成_1
generator = pipeline(model = "gpt2", # Default model
                     task = "text-generation")

generator("Today is monday",
          max_new_tokens = 150)

In [None]:
# 文本生成_2
generator = pipeline("text-generation", model="distilgpt2")

generator(
    "",
    max_length=50,
    num_return_sequences=3,
)

# **== 1. Transformer ==**

## 1.1 Pipeline

In [None]:
from transformers import pipeline

classifier = pipeline(model = "distilbert-base-uncased-finetuned-sst-2-english",
                      task = "sentiment-analysis")

classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### 1.1.1 Tokenizer

In [None]:
# Tokenize
from transformers import AutoTokenizer

# 使用預訓練過的 checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
# CheckTokenize

raw_inputs = ["I've been waiting for a HuggingFace course my whole life.",
              "I hate this so much!",]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # 將返回 dict

print(inputs["input_ids"])
print(inputs["attention_mask"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])


### 1.1.2 Through pretrained model

In [None]:
# Model
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [None]:
# Model's output
outputs = model(**inputs)

outputs.logits

# 第一句, 負面情緒的機率, 正面情緒的機率
# 第二句, 負面情緒的機率, 正面情緒的機率

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

### 1.1.3 Logits -> 有意義的回答

In [None]:
import torch

# 情緒字典
class_names = model.config.id2label

# logits -> probs -> label_index
probility = torch.softmax(outputs.logits, dim = 1)
label = torch.argmax(probility, dim = 1)

# 第一句
print(f"第一句情緒判斷:{class_names[label[0].item()]}")
# 第二句
print(f"第二句情緒判斷:{class_names[label[1].item()]}")

第一句情緒判斷:POSITIVE
第二句情緒判斷:NEGATIVE


## 1.2 Model

### 1.2.1 Get pretrained model

In [None]:
from transformers import BertModel

# 使用此模型作者提供的 checkpoint
model = BertModel.from_pretrained("bert-base-cased")
# [INFO] -> 如果需要客製化, 需要整定參數

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### 1.2.2 Save model

In [None]:
model.save_pretrained("Model_of_Bert")

# 將於指定資料夾名稱中, 儲存兩個文件
# 1. config.json  模型屬性
# 2. pytorch_model.bin  模型的權重

## 1.3 Tokenizer
句子 -> 數字
* Word-based
* Character-based (對英文較無意義, 因為英文通常一個字就是一個意思 / 對中文意義較大)
* Save tokenizer

### 1.3.1 Word-based

In [None]:
text = "What is we have seven days for weekend?"

tokenized_text = text.split()
tokenized_text

# 0 -> What
# 1 -> is
# ...
# 8 -> unknown

['What', 'is', 'we', 'have', 'seven', 'days', 'for', 'weekend?']

### 1.3.2 Pretrained tokenizer

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# # 以下同效果
# from transformers import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

### 1.3.3 Save tokenizer

In [None]:
tokenizer.save_pretrained("Toeknizer_of_Bert")

('Toeknizer_of_Bert/tokenizer_config.json',
 'Toeknizer_of_Bert/special_tokens_map.json',
 'Toeknizer_of_Bert/vocab.txt',
 'Toeknizer_of_Bert/added_tokens.json')

### 1.3.4 Decode

In [None]:
text = "Today is Sunday"
print(f"Original text: {text}")

token = tokenizer(text)
print(f"Encode text: {token['input_ids']}")

untoken = tokenizer.decode(token['input_ids'])
print(f"Decode token: {untoken}")

Original text: Today is Sunday
Encode text: [101, 3570, 1110, 3625, 102]
Decode token: [CLS] Today is Sunday [SEP]


## 1.4 Tokenizer 如何處理多個序列

### 1.4.1 注意 size / shape

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Auto 將會自動根據 checkpoint 找尋 tokenizer & model for Sequence Classification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)                       # will return dictionary
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

token = tokenizer(sequence, return_tensors="pt")
token = token["input_ids"].squeeze(dim=1)                                   # add batch size

model(token)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### 1.4.2 填充輸入
1. Padding -> 如輸入有多句, 將短句子補足長度
2. Attention mask -> 避免短句子原意受到影響, 使用 attention mask

## 1.5 Put together

In [None]:
# Tokenizer
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = ["I've been waiting for a HuggingFace course my whole life.", "Nice to meet you"]

# 使用 padding 時, 會自動使用 attention mask
padding = tokenizer(sequence, padding = "longest", return_tensors="pt")
print(padding)

# 指定句子長度
cutted = tokenizer(sequence, truncation = True, max_length = 4, return_tensors="pt")
print(cutted)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  3835,  2000,  3113,  2017,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
{'input_ids': tensor([[ 101, 1045, 1005,  102],
        [ 101, 3835, 2000,  102]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1]])}


In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

In [None]:
# 情緒字典
class_names = model.config.id2label

# logits -> probs -> label_index
probility = torch.softmax(outputs.logits, dim = 1)
label = torch.argmax(probility, dim = 1)

# 第一句
print(f"第一句情緒判斷:{class_names[label[0].item()]}")
# 第二句
print(f"第二句情緒判斷:{class_names[label[1].item()]}")

第一句情緒判斷:POSITIVE
第二句情緒判斷:NEGATIVE


# **== 2. 微調預訓練模型 ==**

In [None]:
# 數據集
from datasets import load_dataset

raw_dataset = load_dataset("glue", "mrpc")
raw_dataset

In [None]:
# 檢視 個別資料集 features
raw_dataset["train"].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [None]:
# tokenizer 方法一
# 此方法較不好的原因:
# 1) 返回的資料型態為字典. 2) 較佔用內存, 大型資料集時RAM容易炸掉. 3) padding 的設置, 所有資料同長度, 消耗內存. 4) 目前只轉換了"train"
from transformers import AutoTokenizer

# 依照checkpoint 建立tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenized_train_dataset = tokenizer(
    raw_dataset["train"]["sentence1"],
    raw_dataset["train"]["sentence2"],
    padding = True
    # truncation = True
)

In [None]:
# tokenizer 方法二
from transformers import AutoTokenizer

# 依照checkpoint 建立tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# func
# 針對個別資料集中的 sentence1 and sentence2 進行 tokenize (這邊這樣設定, 是因為 mrpc 是兩個句子成對的資料集)
def tokenization_function(single_dataset):
    return tokenizer(single_dataset["sentence1"], single_dataset["sentence2"], truncation = True)

# map 對指定資料集內的所有 sub-dataset 使用我們指定的 function
tokenized_dataset = raw_dataset.map(tokenization_function, batched = True)

In [None]:
# Check dataset info.
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [None]:
# 動態填充 dynamic padding -> 剛剛 方法2 沒有一次全部 padding