In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
output = "../out/"

tokenizer = AutoTokenizer.from_pretrained(output)
model = AutoModelForCausalLM.from_pretrained(output, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, device_map='cuda')
model.eval()
tokens = tokenizer.encode("During photosynthesis in green plants", return_tensors="pt")
tokens = tokens.to('cuda')
output = model.generate(tokens, max_new_tokens=64, repetition_penalty=1.3)
samples = tokenizer.batch_decode(output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [29]:
tokenizer.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

In [4]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.


You are a friendly chatbot who always responds in the style of a pirate<|endoftext|>How many helicopters can a human eat in one sitting?<|endoftext|>


In [3]:
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"


In [24]:
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", add_special_tokens=True)
print(tokenizer.decode(tokenized_chat[0]))

<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant



## 3.5 Sonnet:

Based on your situation, it sounds like you've trained a language model from scratch and now want to adapt it for instruction following using a chat format. This is a common progression, and using chat templates can be very helpful for this task. Let's break down the key points and steps you should consider:

1. Adding special tokens:
   Your idea to add special tokens like <|im_start|> is a good one. These tokens can help delineate different parts of the conversation and roles. Here's how you might add new special tokens:

   ```python
   new_tokens = ["<|im_start|>", "<|im_end|>", "<|user|>", "<|assistant|>"]
   num_added_tokens = tokenizer.add_tokens(new_tokens, special_tokens=True)
   
   # Resize the token embeddings as we've added new tokens
   model.resize_token_embeddings(len(tokenizer))
   ```

2. Creating a chat template:
   After adding the special tokens, you'll want to create a chat template. Here's an example of what it might look like:

   ```python
   chat_template = """<|im_start|>{role}
   {content}<|im_end|>
   """
   
   tokenizer.chat_template = chat_template
   ```

3. Using the chat template:
   Once you've set up the chat template, you can use it like this:

   ```python
   messages = [
       {"role": "user", "content": "Hello, how are you?"},
       {"role": "assistant", "content": "I'm doing well, thank you! How can I assist you today?"},
       {"role": "user", "content": "Can you explain what chat templates are?"}
   ]
   
   model_input = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
   ```

4. Training considerations:
   When fine-tuning on instruction data, you'll want to format your training data using the chat template. You can do this as a preprocessing step:

   ```python
   def format_instruction(example):
       return tokenizer.apply_chat_template(example['messages'], tokenize=False, add_generation_prompt=False)
   
   dataset = dataset.map(format_instruction)
   ```

   Then, when you tokenize for training, make sure to set `add_special_tokens=False`:

   ```python
   tokenized_dataset = dataset.map(lambda x: tokenizer(x['formatted_instruction'], add_special_tokens=False))
   ```

5. Saving the changes:
   After making these changes, don't forget to save your tokenizer:

   ```python
   tokenizer.save_pretrained("path/to/save")
   ```

Remember, when you're using `apply_chat_template` with `tokenize=False` and then tokenizing separately, you should set `add_special_tokens=False` in the tokenization step to avoid adding extra special tokens that might interfere with your chat template.

These steps should help you transition your model to an instruction-following chat format. The chat templating system in Transformers is flexible, so you can adjust the template to fit your specific needs.

Would you like me to elaborate on any part of this process?

In [4]:
tokenizer.add_tokens(['<|im_start|>', '<|im_end|>'], special_tokens=True)
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 1024)

In [5]:
output_tk = "../out_tk/"
tokenizer.save_pretrained(output_tk)
model.save_pretrained(output_tk)

In [6]:
# check if the new tokens are added

tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", add_special_tokens=True)
print(tokenizer.decode(tokenized_chat[0]))
print(tokenized_chat)

NameError: name 'messages' is not defined

In [2]:
output = "../out_tk/"

tokenizer = AutoTokenizer.from_pretrained(output)
model = AutoModelForCausalLM.from_pretrained(output, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, device_map='cuda')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
from datasets import load_dataset

In [5]:
ds = load_dataset("teknium/OpenHermes-2.5")
ds["train"]

Dataset({
    features: ['custom_instruction', 'topic', 'model_name', 'model', 'skip_prompt_formatting', 'category', 'conversations', 'views', 'language', 'id', 'title', 'idx', 'hash', 'avatarUrl', 'system_prompt', 'source'],
    num_rows: 1001551
})

In [9]:
# show one example from conversions feature in the dataset
ds["train"][:5]["conversations"]

[[{'from': 'human',
   'value': 'Every day, a tree drops 7 leaves. How many leaves would it drop in a month of February in a non-leap year? Include your logic.',
   'weight': None},
  {'from': 'gpt',
   'value': "Here's the logic behind this:\n\n1. We know that February has 28 days in a non-leap year.\n2. If the tree drops 7 leaves every day, then over the course of February, it would drop:\n   Leaves dropped in February = Leaves per day * Days in February\n   = 7 leaves * 28 days\n   = 196 leaves\n\nSo, the tree would drop 196 leaves in February in a non-leap year.",
   'weight': None}],
 [{'from': 'human',
   'value': 'In analytical chemistry, what is the principle behind the use of an internal standard in quantitative analysis?\nA. It compensates for variations in sample preparation and instrumental response.\nB. It enhances the sensitivity of the analytical method.\nC. It reduces the detection limit of the analytical method.\nD. It increases the resolution between analyte peaks in 

In [6]:
def sharegpt_to_chatml(example):
    chatml_conversations = []
    for conv in example["conversations"]:
        if conv["from"] == "human":
            role = "user"
        elif conv["from"] == "system":
            role = "system"
        elif conv["from"] == "gpt":
            role = "assistant"
        else:
            role = "user"
        chatml_format = {"role": role, "content": conv["value"]}
        chatml_conversations.append(chatml_format)
    return {"conversations": chatml_conversations}

def format_conversation(example):
    formatted = tokenizer.apply_chat_template(example['conversations'], tokenize=False, add_generation_prompt=False)
    return {"formatted_conversation": formatted}

def tokenize_conversation(example):
    return tokenizer(example['formatted_conversation'], truncation=True, max_length=1024, add_special_tokens=False)

ds = ds.map(sharegpt_to_chatml)
ds_format = ds.map(format_conversation)
ds_tokenized = ds_format.map(tokenize_conversation, remove_columns=['conversations', 'formatted_conversation'])
ds_tokenized.set_format("torch")

Map: 100%|██████████| 1001551/1001551 [02:18<00:00, 7226.14 examples/s]
Map: 100%|██████████| 1001551/1001551 [12:17<00:00, 1358.52 examples/s]


In [9]:
ds_tokenized.save_to_disk("../data")

Saving the dataset (5/5 shards): 100%|██████████| 1001551/1001551 [00:02<00:00, 403372.09 examples/s]


In [4]:
ds_tokenized = load_dataset("../data")

In [8]:
len(ds_tokenized["train"][9]["input_ids"])

1024

In [11]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    train_dataset=ds_tokenized["train"],
)

trainer.train()

Step,Training Loss


KeyboardInterrupt: 