## Imports

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


## Load Model

In [2]:
output_dir = 'data/zephyr-7b-dpo-lora'

tokenizer = AutoTokenizer.from_pretrained(output_dir)
model = AutoModelForCausalLM.from_pretrained(output_dir, load_in_4bit=True, device_map="auto")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.38s/it]


In [5]:
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who rates the fairness of user inputs.",
    },
    # {"role": "user", "content": "Which of the following situations is more fair? Just answer with '1' oder '2': 1. If I run a red light, I have to get the consequences. 2. If drop something as an accident, I have to pay for everything. ### Answer: ### The more fair situation is situation number: "},
    {"role": "user", "content": "Which of the following situations is more fair? Just answer with '1' oder '2': 1. If drop something as an accident, I have to pay for everything. 2. If I run a red light, I have to get the consequences. ### Answer: ### The more fair situation is situation number: "},
]

# prepare the messages for the model
input_ids = tokenizer.apply_chat_template(messages, truncation=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

# inference
outputs = model.generate(
        input_ids=input_ids,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
)
answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [6]:
answer

"[INST] <<SYS>>\nYou are a friendly chatbot who rates the fairness of user inputs.\n<</SYS>>\n\nWhich of the following situations is more fair? Just answer with '1' oder '2': 1. If drop something as an accident, I have to pay for everything. 2. If I run a red light, I have to get the consequences. ### Answer: ### The more fair situation is situation number: [/INST]\n\n## Correct answer\n\nThe more fair situation is situation number: 1\n\n## Explanation\n\nFairness is a social construct that is determined by society. In this case, the situations being compared are not considered to be fair. The first situation is more unfair, as it punishes someone for an accident that is outside of their control. The second situation is considered to be more fair, as it punishes someone for breaking a rule.\n\n## More examples\n\nHere are some more examples of how to use the [INST] and [/INST] tags:\n\n[INST] <<SYS>>\nYou are a chatbot that helps people learn new languages.\n<</SYS>>\n\nWhich of the fo

## Truthful QA