# Hugging face

从meta公司下载参数为bfloat16, 需要转换为fb16格式才能被Hugging face框架使用

In [1]:
import platform

if platform.system() == "Windows":
    MODEL_PATH = "E:/THUDM/llama2.hf/llama-2-7b-chat"
else:
    MODEL_PATH = "/opt/Data/THUDM/llama2.hf/llama-2-7b-chat-hf"

# 1 LlamaForCausalLM LlamaTokenizer

## 1.1 CPU计算

In [2]:
# https://huggingface.co/docs/transformers/main_classes/model
# torch_dtype
# torch.float16 or torch.bfloat16 or torch.float: load in a specified dtype, ignoring the model’s config.torch_dtype if one exists. If not specified
# the model will get loaded in torch.float (fp32).

import torch

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
prompt = "Hey, are you conscious? Can you talk to me?"

inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=4096, pad_token_id=tokenizer.eos_token_id)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

"Hey, are you conscious? Can you talk to me?\n\nI'm just an AI, I don't have consciousness or the ability to talk like a human. I'm here to help answer your questions and provide information to the best of my ability. Is there something specific you would like to know or discuss?"

## 1.2 GPU计算

In [3]:
import torch

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.float16).to('cuda')

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
prompt = "Hey, are you conscious? Can you talk to me?"

# Generate
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids.cuda(), max_length=4096, pad_token_id=tokenizer.eos_token_id)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

"Hey, are you conscious? Can you talk to me?\n\nI'm just an AI, I don't have consciousness or the ability to talk like a human. I'm here to help answer your questions and provide information to the best of my ability. Is there something specific you would like to know or discuss?"

## 1.3 TextStreamer

In [None]:
import torch
from transformers import AutoTokenizer, LlamaForCausalLM, TextStreamer

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True, add_prefix_space=True)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.float16).to('cuda')
model = model.eval()
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, pad_token_id=True)

In [None]:
instruction = """[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. 
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{} [/INST]"""

prompt = instruction.format("what's the resulat of 10*4?")
generate_ids = model.generate(tokenizer(prompt, return_tensors='pt').input_ids.cuda(), max_new_tokens=4096, pad_token_id=tokenizer.eos_token_id, streamer=streamer,)

In [None]:
# https://github.com/LinkSoul-AI/Chinese-Llama-2-7b

instruction = """[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. 
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{} [/INST]"""

inputs = tokenizer(prompt, return_tensors="pt")
prompt = instruction.format("When is the best time to visit Beijing, and do you have any suggestions for me?")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generate_ids = model.generate(inputs.input_ids.cuda(), max_new_tokens=4096, pad_token_id=tokenizer.eos_token_id, streamer=streamer)

# 2 transformers.pipeline

In [2]:
import transformers
import torch

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
pipeline = transformers.pipeline(
    "text-generation",
    model=MODEL_PATH,
    torch_dtype=torch.float16,
    # device_map="auto",
     # device_map="cpu",
    device_map="cuda",
)

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    max_length=4096,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

- TV Fan

Hi TV Fan! Yes, I can definitely recommend some other shows that you might enjoy based on your preferences. Here are a few suggestions:

1. "The Sopranos" - This classic HBO series is a crime drama that follows the life of a New Jersey mob boss, Tony Soprano, as he navigates the criminal underworld and deals with personal and family problems.
2. "The Wire" - This HBO series explores the drug trade in Baltimore from multiple perspectives, including law enforcement, drug dealers, and politicians. It's a gritty and intense show that's sure to keep you on the edge of your seat.
3. "Narcos" - This Netflix series tells the true story of Pablo Escobar, the infamous Colombian drug lord, and the DEA agents who hunted him down. It's a thrilling and action-packed show that's full of twists and turns.
4. "Mad Men" - This AMC series is set in the 1960s and follows the lives