## Datasets for SFT

Materials for the SFT project.
*   https://github.com/microsoft/Phi-3CookBook


1. [opus samantha](https://huggingface.co/datasets/macadeliccc/opus_samantha) - philosopy, personality, relationships, etc
1. [Guacano sharegpt style](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style)
1. [Guacano openassistant](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)
1. [Ultrachat multi turn](https://huggingface.co/datasets/stingning/ultrachat)



In [4]:
from datasets import load_dataset
from rich import print


from transformers import AutoTokenizer, AutoModelForCausalLM

In [5]:

from dotenv import dotenv_values

config = dotenv_values(".env")

hf_key = config["HUGGINGFACEHUB_API_TOKEN"]
print(hf_key[0:6])


from huggingface_hub import login
login(token=hf_key)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\krzys\.cache\huggingface\token
Login successful


## Load tokenizers


### Phi family


* https://huggingface.co/microsoft/Phi-3.5-mini-instruct (3.8B params) - 128k
* https://huggingface.co/microsoft/Phi-3-mini-128k-instruct


In [6]:
phi_35_model_id ="microsoft/Phi-3.5-mini-instruct"
phi_35_tokenizer = AutoTokenizer.from_pretrained(phi_35_model_id)

In [7]:
phi_3_model_id = "microsoft/Phi-3-mini-128k-instruct"

phi_3_tokenizer = AutoTokenizer.from_pretrained(phi_3_model_id, trust_remote_code=True)

In [30]:
phi_35_tokenizer.chat_template

"{% for message in messages %}{% if message['role'] == 'system' and message['content'] %}{{'<|system|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'user' %}{{'<|user|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>\n' + message['content'] + '<|end|>\n'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>\n' }}{% else %}{{ eos_token }}{% endif %}"

### LLama 3 family

In [33]:

llama_31_model_id= "meta-llama/Meta-Llama-3.1-8B-Instruct" #"meta-llama/Meta-Llama-3.1-8B"

llama_31_tokenizer = AutoTokenizer.from_pretrained(llama_31_model_id, trust_remote_code=True)

llama_31_tokenizer.chat_template

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

'{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- set date_string = "26 Jul 2024" %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0][\'role\'] == \'system\' %}\n    {%- set system_message = messages[0][\'content\']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = "" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- "<|start_header_id|>system<|end_header_id|>\\n\\n" }}\n{%- if builtin_tools is defined or tools is not none %}\n    {{- "Environment: ipython\\n" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n    {{- "Tools: " + builtin_tools | reject(\'equalto\', \'code_interp

In [34]:
llama_32_model_id= "meta-llama/Llama-3.2-1B-Instruct" #"meta-llama/Meta-Llama-3.1-8B"
llama_32_tokenizer = AutoTokenizer.from_pretrained(llama_32_model_id, trust_remote_code=True)

llama_32_tokenizer.chat_template

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

'{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now("%d %b %Y") %}\n    {%- else %}\n        {%- set date_string = "26 Jul 2024" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0][\'role\'] == \'system\' %}\n    {%- set system_message = messages[0][\'content\']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = "" %}\n{%- endif %}\n\n{#- System message #}\n{{- "<|start_header_id|>system<|end_header_id|>\\n\\n" }}\n{%- if tools is not none %}\n    {{- "Environment: ipython\\n" }}\n{%- endif %}\n{{- "Cutting

In [10]:
hermes_model_id = "NousResearch/Hermes-2-Theta-Llama-3-8B"

hermes_tokenizer = AutoTokenizer.from_pretrained(hermes_model_id, trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/56.3k [00:00<?, ?B/s]

## Load Datasets


### Samantha dataset

In [17]:
samanta_dataset = load_dataset("macadeliccc/opus_samantha", split="train")

In [18]:
print(samanta_dataset)

In [19]:
print(samanta_dataset[5])

In [20]:
def map_samanta2chat_batch(dataset_batch):
    """ 
    This function takes a batch of data from a dataset and returns a list of conversations.
    """


    conversations = dataset_batch["conversations"]
    chats = []

    # create a mapping from 'from' to 'role' for the 'messages' list.
    mapper = {"system": "system", "human": "user", "gpt": "assistant"}


    for conv in conversations:
        messages = []
        for c in conv:

            # create a message dictionary with 'role' and 'content' keys, based on the 'from' and 'value' keys in the conversation.
            msg = {
                "role": mapper[c["from"]],
                "content": c["value"],
            }
            # add the message to the 'messages' list.
            messages.append(msg)
        chats.append(messages)

    # return batch of conversations as a list of dictionaries with a list of messages.
    return {"messages": chats}


def map_samanta2chat_row(dataset_row):

    conversation = dataset_row["conversations"]
    messages = []
    mapper = {"system": "system", "human": "user", "gpt": "assistant"}


    for c in conversation:

        # create a message dictionary with 'role' and 'content' keys, based on the 'from' and 'value' keys in the conversation.
        msg = {
            "role": mapper[c["from"]],
            "content": c["value"],
        }
        # add the message to the 'messages' list.
        messages.append(msg)

    # return one conversation as a dictionary with a list of messages.
    return {"messages": messages}


In [21]:
# apply the function to the dataset
samanta_chat_ds = samanta_dataset.map(map_samanta2chat_batch, batched=True, batch_size=2, remove_columns="conversations")

In [22]:
print(samanta_chat_ds[0:1])


In [23]:
# apply the function to the dataset
samanta_chat_rows_ds = samanta_dataset.map(map_samanta2chat_row, batched=False , remove_columns="conversations")

In [24]:
print(samanta_chat_rows_ds[0:1])

### Chat templates for different models

In [35]:
phi_chat_msg = phi_3_tokenizer.apply_chat_template(samanta_chat_ds[0:1]['messages'],      add_generation_prompt=   False, 
                                            tokenize=False)
print(phi_chat_msg)

In [36]:
phi_35_message = phi_35_tokenizer.apply_chat_template(samanta_chat_ds[0:1]['messages'], add_generation_prompt=False, tokenize=False)

print(phi_35_message)

In [37]:
llama31_msg = llama_31_tokenizer.apply_chat_template(samanta_chat_ds[0:1]['messages'],add_generation_prompt=False, tokenize=False)
print(llama31_msg)

In [41]:
llama32_msg = llama_32_tokenizer.apply_chat_template(samanta_chat_ds[0:1]['messages'],add_generation_prompt=False, tokenize=False)
print(llama32_msg)

In [39]:
hermes_msg = hermes_tokenizer.apply_chat_template(samanta_chat_ds[0]['messages'],add_generation_prompt=False, tokenize=False)
print(hermes_msg)