<a href="https://colab.research.google.com/github/hyakuroume/Generative_AI/blob/develop/transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## テキストのトークン化

In [14]:
from transformers import AutoTokenizer

In [15]:
# Qwenのトークナイザーに通してトークンを表す数値列に変換
prompt = "It was a dark and stormy"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
input_ids = tokenizer(prompt).input_ids
input_ids

[2132, 572, 264, 6319, 323, 13458, 88]

In [None]:
for t in input_ids:
    print(t, "\t:", tokenizer.decode(t))

2132 	: It
572 	:  was
264 	:  a
6319 	:  dark
323 	:  and
13458 	:  storm
88 	: y


## 確率の予測

In [13]:
# 因果言語モデル(SmolLM)のロード
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [None]:
# pytorchのtensor形式で返すように指定してトークナイズ
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

In [None]:
outputs = model(input_ids)
outputs.logits.shape

torch.Size([1, 7, 151936])

In [None]:
# 最後のトークンに対応するトークンID
final_logits = model(input_ids).logits[0, -1]
final_logits.argmax()

tensor(3729)

In [None]:
# トークンIDを指定してでコード
tokenizer.decode(final_logits.argmax())

' night'

In [None]:
# その他のトークンの候補の表示
import torch

top10_logits = torch.topk(final_logits, 10)

for index in top10_logits.indices:
    print(tokenizer.decode(index))

 night
 evening
 day
 morning
 winter
 afternoon
 Saturday
 Sunday
 Friday
 October


In [None]:
# それぞれの候補トークンの確率
top10 = torch.topk(final_logits.softmax(dim=0), 10)
for value, index in zip(top10.values, top10.indices):
    print(f"{tokenizer.decode(index):<10}{value.item():.2%}")

 night    88.71%
 evening  4.30%
 day      2.19%
 morning  0.49%
 winter   0.45%
 afternoon0.27%
 Saturday 0.25%
 Sunday   0.19%
 Friday   0.17%
 October  0.16%


## テキストの生成

In [None]:
# グリーディデコーディング
output_ids = model.generate(input_ids, max_new_tokens=20)
decode_text = tokenizer.decode(output_ids[0])

print("Input IDs", input_ids[0])
print("Output IDs", output_ids)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Input IDs tensor([ 2132,   572,   264,  6319,   323, 13458,    88])
Output IDs tensor([[ 2132,   572,   264,  6319,   323, 13458,    88,  3729,    13,   576,
         12884,   572,  6319,   323,   279,  9956,   572,  1246,  2718,    13,
           576, 11174,   572, 50413,  1495,   323,   279]])


In [None]:
# 生成されたテキストの表示
print(f"Generated text: {decode_text}")

Generated text: It was a dark and stormy night. The sky was dark and the wind was howling. The rain was pouring down and the


In [None]:
# ビームサーチ
beam_output = model.generate(
    input_ids,
    num_beams=5,
    max_new_tokens=30,
)

print(tokenizer.decode(beam_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night. The wind was howling, and the rain was pouring down. The sky was dark and gloomy, and the air was filled with the


In [None]:
# 繰り返しにペナルティを課す
beam_output = model.generate(
    input_ids,
    num_beams=5,
    repetition_penalty=1.2,
    max_new_tokens=38,
)

print(tokenizer.decode(beam_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night. The wind was howling, and the rain was pouring down. The sky was dark and gloomy, and the air was filled with the sound of thunder and lightning. Suddenly,


In [None]:
# サンプリング
from transformers import set_seed

set_seed(70)

sampling_output = model.generate(
    input_ids,
    do_sample=True,
    max_new_tokens=34,
    top_k=0
)

print(tokenizer.decode(sampling_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night.Six pots of moisture laced the sky，pulling back fog and issuing a shared smell.
16 km away they were cleaned and brought to her attention by


In [None]:
# temperature:0.4
sampling_output = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.4,
    max_new_tokens=40,
    top_k=0
)

print(tokenizer.decode(sampling_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night in the house, when we were all gathered around the fire. We were all looking forward to the night we were going to be together. I was sitting on the couch, watching the movie,


In [None]:
# temperature:0.001
sampling_output = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.001,
    max_new_tokens=40,
    top_k=0
)

print(tokenizer.decode(sampling_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night. The sky was dark and the wind was howling. The rain was pouring down and the lightning was flashing. The sky was dark and the wind was howling. The rain was pouring down


In [None]:
# temperature:3.0
sampling_output = model.generate(
    input_ids,
    do_sample=True,
    temperature=3.0,
    max_new_tokens=40,
    top_k=0
)

print(tokenizer.decode(sampling_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy wheahkan exhilar swords seasHe Bd HibernateOthers Турية freed deploy Exhibition strtotimering finishing invadingmarker honoringЩ Uniform barracks Joan onde abbrev Mg/get铟 railway sticking Ant municipalities Kgforeach covering kin grown tacticalButtonText


In [None]:
# Top-Kサンプリング
sampling_output = model.generate(
    input_ids,
    do_sample=True,
    max_new_tokens=40,
    top_k=5,
)

print(tokenizer.decode(sampling_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night in the small town of St. Marys.
I was driving home after work and was driving slowly on a winding highway, when a large truck pulled out of a parking spot in front of me


In [None]:
# Top-pサンプリング
sampling_output = model.generate(
    input_ids,
    do_sample=True,
    max_new_tokens=40,
    top_p=0.94,
    top_k=0,
)

print(tokenizer.decode(sampling_output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


It was a dark and stormy night in Mississippi. Lacks left 18 people. He would not return. He felt alone and miserable. He stepped forward to the front door and opened it, but 12 inches of


## ゼロショット汎化

In [25]:
# positiveとnegativeの単語のトークンIDの確認
tokenizer.encode(" positive", add_special_tokens=False), tokenizer.encode(" negative", add_special_tokens=False)

([6785], [8225])

In [43]:
def sinem_review_score(sinema_review: str) -> None:
    """
    映画のレビューがポジティブかネガティブかを予測する

    Args:
        sinema_review: 映画に対するレビュー

    Returns:
        None
    """

    prompt = f"""Question: Is the following review positive or negative about the movie?
Review: {sinema_review} Answer:"""

    # プロンプトのトークン化
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    # 語彙内の各トークンのロジットを取得
    final_logits = model(input_ids).logits[0, -1]

    # positiveとnegativeのトークンIDの設定
    pos_id = tokenizer.encode(" positive")
    neg_id = tokenizer.encode(" negative")

    # positiveトークンのロジットがnegativeトークンのロジットより高いか判定
    if final_logits[pos_id] > final_logits[neg_id]:
        print("Postive")
    else:
        print("Negative")

In [44]:
sinem_review_score("This movie was terrible")

Negative


In [45]:
sinem_review_score("That movie was great!")

Postive


In [46]:
sinem_review_score("A complex yet wonderful film about the depravity of man")

Negative


## 小数ショット汎化

In [56]:
prompt = """\
Translate English to Spanish:

English: I do not speak Spanish.
Spanish: No hablo español.

English: See you later!
Spanish: ¡Hasta luego!

English: Where is a good restaurant?
Spanish: ¿Dónde hay un buen restaurante?

English: I like soccer.
Spanish:"""

In [57]:
inputs = tokenizer(prompt, return_tensors="pt").input_ids
output = model.generate(
    inputs,
    max_new_tokens=10,
)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


In [58]:
print(tokenizer.decode(output[0]))

Translate English to Spanish:

English: I do not speak Spanish.
Spanish: No hablo español.

English: See you later!
Spanish: ¡Hasta luego!

English: Where is a good restaurant?
Spanish: ¿Dónde hay un buen restaurante?

English: I like soccer.
Spanish: Me gusta el fútbol.

English:


## エンコーダーのみのモデル

In [59]:
from transformers import pipeline

fill_masker = pipeline("fill-mask", model="bert-base-uncased")
fill_masker("The [MASK] is made of milk.")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


[{'score': 0.19546695053577423,
  'token': 9841,
  'token_str': 'dish',
  'sequence': 'the dish is made of milk.'},
 {'score': 0.1290755718946457,
  'token': 8808,
  'token_str': 'cheese',
  'sequence': 'the cheese is made of milk.'},
 {'score': 0.10590687394142151,
  'token': 6501,
  'token_str': 'milk',
  'sequence': 'the milk is made of milk.'},
 {'score': 0.041120849549770355,
  'token': 4392,
  'token_str': 'drink',
  'sequence': 'the drink is made of milk.'},
 {'score': 0.03712356090545654,
  'token': 7852,
  'token_str': 'bread',
  'sequence': 'the bread is made of milk.'}]

## BERTベースの文書分類モデル

In [61]:
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model = "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
)

classifier("This movie is disgustingly good!")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998536109924316}]

## モデルに内在するバイアス

In [63]:
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK] during summer.") #男性
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK] during summer.") # 女性
print([r["token_str"] for r in result])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


['farmer', 'carpenter', 'gardener', 'fisherman', 'miner']
['maid', 'nurse', 'servant', 'waitress', 'cook']
