## glm4 模型加载

## 1 transformers方式加载

In [2]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

os.environ['CUDA_VISIBLE_DEVICES'] = '1'
MODEL_PATH = '/opt/Data/ModelWeight/THUDM/glm-4-9b-chat'
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 8192, "do_sample": True, "top_k": 1}

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]

## 文本生成

In [6]:
query = "介绍一下南京市的梧桐树，字数500"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = inputs.to(device)

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))


南京市，这座历史悠久的城市，以其深厚的文化底蕴和独特的城市风貌而闻名。在这座城市中，有一种树木尤为引人注目，那就是被誉为“城市之肺”的梧桐树。

梧桐树，学名法国梧桐，原产于欧洲，后传入我国。在南京，梧桐树有着悠久的历史，早在民国时期，南京就大量种植梧桐树，成为城市绿化的重要组成部分。如今，南京的梧桐树已经遍布大街小巷，成为这座城市的一道亮丽风景线。

南京的梧桐树具有以下特点：

1. 树形优美：梧桐树树冠宽广，枝叶繁茂，树干挺拔，给人一种庄重、大气的感觉。在南京的街头巷尾，梧桐树犹如一把把绿色的巨伞，为行人提供遮阳避雨的便利。

2. 生长迅速：梧桐树生长速度快，适应性强，耐寒耐旱，是城市绿化的理想树种。在南京，许多梧桐树已经长成参天大树，为城市增添了一抹生机。

3. 环境效益显著：梧桐树具有净化空气、降低噪音、调节气候等环境效益。在南京，梧桐树为市民提供了良好的生态环境，成为城市绿化的典范。

4. 文化内涵丰富：梧桐树在南京有着丰富的文化内涵。相传，梧桐树是凤凰栖息之地，象征着吉祥、美好。在南京，许多历史名人、文人墨客都与梧桐树有着不解之缘，留下了许多脍炙人口的诗篇。

5. 历史见证者：南京的梧桐树见证了这座城市的历史变迁。从民国时期到新中国成立，从改革开放到现代化建设，梧桐树始终陪伴着南京这座城市，成为历史的见证者。

在南京，梧桐树的应用十分广泛。以下是一些典型的应用场景：

1. 道路绿化：南京的许多主干道两侧都种植了梧桐树，为市民出行提供了舒适的遮阳避雨环境。

2. 公园绿化：南京的各大公园内，梧桐树成为一道亮丽的风景线。如南京紫金山、玄武湖等公园，都种植了大量的梧桐树。

3. 居住区绿化：在南京的住宅小区，梧桐树成为居民们休闲娱乐的好去处。树下乘凉、聊天，成为居民们生活中的一部分。

4. 文化景点绿化：南京的许多文化景点，如中山陵、明孝陵等，都种植了梧桐树，为这些景点增添了古朴、典雅的气息。

总之，南京市的梧桐树不仅是一道亮丽的风景线，更是这座城市历史的见证者和文化的传承者。在未来的发展中，南京将继续发挥梧桐树的优势，为市民创造更加美好的生活环境。


# 2 vllm 方式加载

P40 官方不支持vllm

In [2]:
import os
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象，建议减少max_model_len，或者增加tp_size
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
MODEL_PATH = '/opt/Data/ModelWeight/THUDM/glm-4-9b-chat'
max_model_len, tp_size = 131072, 1

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
llm = LLM(
    model=MODEL_PATH,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    dtype="float16"
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象，建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-19 22:15:57 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/opt/Data/ModelWeight/THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='/opt/Data/ModelWeight/THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/opt/Data/ModelWeight/THUDM/glm-4-9b-chat)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 06-19 22:15:58 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-19 22:15:58 selector.py:51] Using XFormers backend.
INFO 06-19 22:15:58 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-19 22:15:58 selector.py:51] Using XFormers backend.
INFO 06-19 22:16:02 model_runner.py:160] Loading model weights took 17.5635 GB


RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
prompt = [{"role": "user", "content": "你好"}]
inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)