## AutoModelForCausalLM
[API Documentation_AutoModelForCausalLM](https://huggingface.co/transformers/v3.5.1/model_doc/auto.html#automodelforcausallm)
| 方式          | 優點                             | 使用時機                     |
|---------------|----------------------------------|------------------------------|
| `pipeline()`  | 簡單快速，適合 demo / 原型開發   | 無需複雜控制，快速實測輸出  |
| 手動 `.generate()` | 彈性高，可控制 decoding 全部參數 | 多輪聊天、chat template、複雜場景 |

## Inference

In [91]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, TextIteratorStreamer
import threading
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model.eval()

# device = "cuda" if torch.cuda.is_available() else "cpu"
# model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [92]:
## Same as AutoModelForCausalLM.from_pretrained(model_name)
# from transformers import GPT2LMHeadModel

# model = GPT2LMHeadModel.from_pretrained("gpt2")

### Regular usage

In [93]:
input_text = "Hello, my name is"

In [94]:
# input_ids = tokenizer("Hello, my name is", return_tensors="pt").input_ids
inputs = tokenizer(
    input_text, 
    return_tensors="pt",
    truncation=True,
    )

outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
)

outputs_decoded = tokenizer.decode(outputs[0], skip_special_tokens=False) # , skip_special_tokens=True

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [96]:
print(outputs_decoded)

Hello, my name is Dave. I'm the creator of the game, and I play the role of the boss. If I have to explain why I do that, I will. My name is Dave. I'm the creator of the game, and I play the role


In [98]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model_name)
result = pipe(input_text, max_new_tokens=200)
print(result[0]["generated_text"])


Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, my name is John Drexler. I love being an artist. For someone who doesn't really know how to make music, being an American musician might be the best job I've ever been offered. Of course, it was a bit hard because I had to use a bunch of songs that weren't my own. I made it from scratch, and I've never heard an album like this before. It may be hard to believe, but I'm really lucky.

The first time I did this was with the Bitch, because it was my first time ever doing a solo project. As a kid, my dad didn't do much besides get up every morning, take a few phone calls, and play guitar along the beach. And, as a musician, I was obsessed with the idea of playing music with other musicians. So what I did was write four songs. And I decided that I wanted to write five songs; so, I went all out on this project. I made five, and


### VLM
- [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
- [Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)

In [None]:
from transformers import pipeline
from PIL import Image

image_token = "<image>"  # 一般 LLaVA 風格
# image_token = "<|image|>"  # 有些 Qwen-VL 會用這個
prompt = f"Describe this image. {image_token}"
image = Image.open("../image/example_llama.jpg").convert("RGB")

messeages = [
    {"role": "user", "content": "Who are you?"}
]

messages = [
     {
         "role": "user",
         "content": [
             {
                 "type": "image",
                #  "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
                 "image": image,
             },
             {"type": "text", "text": f"{prompt}\nAssistant:"},
         ],
     }
 ]

In [86]:
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
import torch
from PIL import Image

image_token = "<image>"  # 一般 LLaVA 風格
# image_token = "<|image|>"  # 有些 Qwen-VL 會用這個
prompt = f"Describe this image. {image_token}"
image = Image.open("../image/example_llama.jpg").convert("RGB")

messeages = [
    {"role": "user", "content": "Who are you?"}
]

messages = [
     {
         "role": "user",
         "content": [
             {
                 "type": "image",
                #  "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
                 "image": image,
             },
             {"type": "text", "text": f"{prompt}\nAssistant:"},
         ],
     }
 ]

model_name = "llava-hf/llava-interleave-qwen-0.5b-hf"

processor = AutoProcessor.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = AutoModelForVision2Seq.from_pretrained(model_name, torch_dtype=torch.bfloat16)

inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt",
    truncation=True,
    )

outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
)

outputs_decoded = processor.decode(outputs[0], skip_special_tokens=False) # , skip_special_tokens=True

print(outputs_decoded)

Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Describe this image. <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image

In [87]:
from transformers import pipeline
from PIL import Image

prompt = "Describe this image."
image = Image.open("../image/example_llama.jpg").convert("RGB")

messages = [
     {
         "role": "user",
         "content": [
             {
                 "type": "image",
                #  "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
                 "image": image,
             },
             {"type": "text", "text": f"{prompt}"},
         ],
     }
 ]

pipe = pipeline("image-text-to-text", model=model_name)
outputs = pipe(text=messages, max_new_tokens=60, return_full_text=False)

result = outputs[0]["generated_text"]
print(result)

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


The image is a close-up photograph of a white, fluffy, and fluffy animal, which appears to be a llama. The animal has a large, round, and fluffy face with a wide, open mouth, and its eyes are wide open. The animal's fur is a soft, cream color,


### Stream usage

In [20]:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

inputs = tokenizer(
    input_text, 
    return_tensors="pt",
    truncation=True,
    )

outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    streamer=streamer,
)

outputs_decoded = tokenizer.decode(outputs[0], skip_special_tokens=False) # , skip_special_tokens=True

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Daniel. I'm a 22 year old guy with a lot of experience in the industry and I'm excited to do this. I'm hoping to be a part of this year's festival and to be able to help out some of the more talented people


In [28]:
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

def generate():
    model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        streamer=streamer,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

gen_thread = threading.Thread(target=generate)
gen_thread.start()

# 主執行緒從 queue 拿 token 並印出（真正逐 token stream）
for token in streamer:
    print(token, end="", flush=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 John J. Lee. I'm the son of an auto mechanic and my wife, Maria, is a veterinarian. I am a resident of Baltimore County, Maryland. I've been living in Maryland for 12 years, and I'm the CEO and COO of the Baltimore County Medical Association. I've been a resident of Maryland for 11 years and am the director of the Maryland Cancer Center and the Medical Center for Infectious Diseases at St. John's Hospital. I'm a graduate student in the College

### Stream usage _ fastAPI

In [22]:
# server.py

from fastapi import FastAPI, WebSocket
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
import threading

app = FastAPI()

# Load tokenizer & model
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

@app.websocket("/chat")
async def chat(websocket: WebSocket):
    await websocket.accept()

    try:
        while True:
            # 1. Receive prompt from client
            prompt = await websocket.receive_text()

            # 2. Tokenize input
            inputs = tokenizer(
                prompt,
                return_tensors="pt",
                truncation=True,
            )

            # 3. Prepare streaming generator
            streamer = TextIteratorStreamer(
                tokenizer,
                skip_prompt=True,
                skip_special_tokens=False
            )

            # 4. Run model.generate() in background thread
            def generate():
                model.generate(
                    input_ids=inputs["input_ids"],
                    attention_mask=inputs["attention_mask"],
                    max_new_tokens=50,
                    do_sample=True,
                    temperature=0.7,
                    top_k=50,
                    top_p=0.95,
                    streamer=streamer,
                )

            generation_thread = threading.Thread(target=generate)
            generation_thread.start()

            # 5. Send tokens as they stream in
            async for token in streamer:
                await websocket.send_text(token)

            await websocket.send_text("[END]")  # 可選：通知結束
    except Exception as e:
        await websocket.close()
        print(f"WebSocket connection closed: {e}")


In [None]:
# uvicorn stream_server:app --host 0.0.0.0 --port 8000

## Training

In [62]:
import os

from dotenv import load_dotenv
import wandb

# 加載 .env 文件
load_dotenv()

# 設置 W&B API Key
os.environ["WANDB_API_KEY"] = os.getenv("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "magpie-finetune"
os.environ["WANDB_NAME"] = "bert-base-magpie-run1"


### dataset

In [63]:

from datasets import load_dataset
from transformers import AutoTokenizer
from sklearn.preprocessing import LabelEncoder
import torch

# 載入 magpie 資料集
dataset = load_dataset("Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B", split="train")

In [64]:
print(dataset.features)

{'conversation_id': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'response': Value(dtype='string', id=None), 'conversations': [{'from': Value(dtype='string', id=None), 'value': Value(dtype='string', id=None)}], 'gen_input_configs': {'temperature': Value(dtype='float64', id=None), 'top_p': Value(dtype='float64', id=None), 'input_generator': Value(dtype='string', id=None), 'seed': Value(dtype='null', id=None), 'pre_query_template': Value(dtype='string', id=None)}, 'gen_response_configs': {'prompt': Value(dtype='string', id=None), 'temperature': Value(dtype='int64', id=None), 'top_p': Value(dtype='float64', id=None), 'repetition_penalty': Value(dtype='float64', id=None), 'max_tokens': Value(dtype='int64', id=None), 'stop_tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'output_generator': Value(dtype='string', id=None), 'engine': Value(dtype='string', id=None)}, 'intent': Value(dtype='string', id=None), 'knowledge': Value(dty

In [65]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

def preprocess_function(examples):
    inputs = [ex for ex in examples['instruction']]
    targets = [ex for ex in examples['response']]
    model_inputs = tokenizer(inputs, max_length=256, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=256, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)


In [66]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

In [None]:
from transformers import AutoModelForSeq2SeqLM, TrainingArguments, Trainer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

training_args = TrainingArguments(
    report_to="wandb",                       # Logging backend (e.g., "tensorboard", "wandb", or "none")
    output_dir="./results",                  # Where to save the model and logs
    evaluation_strategy="steps",             # Run evaluation every "epoch" (other options: "steps", "no")
    save_strategy="epoch",                   # Save checkpoint every "epoch" (or use "steps")
    learning_rate=2e-5,                      # Initial learning rate for AdamW optimizer
    per_device_train_batch_size=8,           # Batch size per device (GPU) during training
    per_device_eval_batch_size=8,            # Batch size per device during evaluation
    num_train_epochs=3,                      # Total number of training epochs
    weight_decay=0.01,                       # L2 weight decay (regularization)
    logging_dir="./logs",                    # Directory to store logs
    logging_steps=50,                        # Log metrics every N steps
    save_total_limit=2,                      # Limit the total amount of checkpoints saved
    fp16=False,                              # Use mixed precision training (set False if not supported)
    no_cuda=True,                            # Disable CUDA (set False if CUDA is available)
    warmup_steps=500,                        # Number of warmup steps for learning rate scheduler
    gradient_accumulation_steps=1,           # Accumulate gradients over multiple steps before optimizer update
    load_best_model_at_end=True,             # Load the best checkpoint (based on eval loss) at end of training
    metric_for_best_model="loss",            # Which metric to use for determining best model
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset, 
)

trainer.train()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Currently logged in as: [33mjeffhong824[0m ([33mai-fast-track-team[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to a

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 進行 fine-tuning（配合 Trainer API）
trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./out", num_train_epochs=3),
    train_dataset=your_dataset,
    eval_dataset=your_eval_dataset
)
trainer.train()
