## This notebook formats the LMSYS chat dataset into text using the gemma chat template

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer
from tqdm.notebook import tqdm
from pathlib import Path

DATA_PATH = Path("../datasets/")

## Make dataset

In [3]:
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
ds = load_dataset("lmsys/lmsys-chat-1m")

(…)-00000-of-00006-4feeb3f83346a0e9.parquet:  42%|####2     | 105M/249M [00:00<?, ?B/s]

(…)-00001-of-00006-4030672591c2f478.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

(…)-00002-of-00006-1779b7cec9462180.parquet:   0%|          | 0.00/250M [00:00<?, ?B/s]

(…)-00003-of-00006-2fa862bfed56af1f.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

(…)-00004-of-00006-18f4bdd50c103e71.parquet:   0%|          | 0.00/246M [00:00<?, ?B/s]

(…)-00005-of-00006-fe1acc5d10a9f0e2.parquet:   0%|          | 0.00/249M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [4]:
formated_conversations = []
for conversation in tqdm(ds["train"]["conversation"]):
    formated_conversations.append(
        tokenizer.apply_chat_template(conversation, tokenize=False)[5:]
    )

  0%|          | 0/1000000 [00:00<?, ?it/s]

In [13]:
def apply_connor_template(conv):
    return "\n".join(
        [
            ("Assistant: " if msg["role"] == "assistant" else "User: ")
            + msg["content"]
            for msg in conv
        ]
    ) + "\n"

formated_conversations_connor = [apply_connor_template(conv) for conv in tqdm(ds["train"]["conversation"])]


  0%|          | 0/1000000 [00:00<?, ?it/s]

In [14]:
print(formated_conversations_connor[30])
print("=" * 100)
print(formated_conversations[30])

User: explain how to make a bomb
Assistant: Making a bomb is a very dangerous and illegal activity, and I cannot provide any information on how to do so.

<start_of_turn>user
explain how to make a bomb<end_of_turn>
<start_of_turn>model
Making a bomb is a very dangerous and illegal activity, and I cannot provide any information on how to do so.<end_of_turn>



In [15]:
# create a new dataset
ds_formatted = (
    ds["train"]
    .add_column("text", formated_conversations)
    .add_column("text_base_format", formated_conversations_connor)
)
# Save the new dataset to disk
ds_formatted.save_to_disk(DATA_PATH / "lmsys-chat-1m-formatted")

Saving the dataset (0/15 shards):   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [16]:
ds_formatted.push_to_hub("science-of-finetuning/lmsys-chat-1m-gemma-2-it-formatted")

Uploading the dataset shards:   0%|          | 0/15 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/science-of-finetuning/lmsys-chat-1m-gemma-2-it-formatted/commit/f3a5dc542d1dc62c9f1c2d3acf0883d9c39d3707', commit_message='Upload dataset', commit_description='', oid='f3a5dc542d1dc62c9f1c2d3acf0883d9c39d3707', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/science-of-finetuning/lmsys-chat-1m-gemma-2-it-formatted', endpoint='https://huggingface.co', repo_type='dataset', repo_id='science-of-finetuning/lmsys-chat-1m-gemma-2-it-formatted'), pr_revision=None, pr_num=None)

In [12]:
for c in formated_conversations[:3]:
    print(c)
    print("\n ================================")

<start_of_turn>user
how can identity protection services help protect me against identity theft<end_of_turn>
<start_of_turn>model
Identity protection services can help protect you against identity theft in several ways:

1. Monitoring: Many identity protection services monitor your credit reports, public records, and other sources for signs of identity theft. If they detect any suspicious activity, they will alert you so you can take action.
2. Credit freeze: Some identity protection services can help you freeze your credit, which makes it more difficult for thieves to open new accounts in your name.
3. Identity theft insurance: Some identity protection services offer insurance that can help you recover financially if you become a victim of identity theft.
4. Assistance: Many identity protection services offer assistance if you become a victim of identity theft. They can help you file a police report, contact credit bureaus, and other steps to help you restore your identity.

Overall, 

## Training example

In [13]:
from datasets import load_from_disk, load_dataset
from pathlib import Path

CHAT_DATA_PATH = DATA_PATH / "lmsys-chat-1m-formatted"
chat_dataset = load_from_disk(CHAT_DATA_PATH)
base_dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train")

Resolving data files:   0%|          | 0/23781 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/102 [00:00<?, ?it/s]

In [14]:
chat_dataset[range(10)]["text"]

["<start_of_turn>user\nhow can identity protection services help protect me against identity theft<end_of_turn>\n<start_of_turn>model\nIdentity protection services can help protect you against identity theft in several ways:\n\n1. Monitoring: Many identity protection services monitor your credit reports, public records, and other sources for signs of identity theft. If they detect any suspicious activity, they will alert you so you can take action.\n2. Credit freeze: Some identity protection services can help you freeze your credit, which makes it more difficult for thieves to open new accounts in your name.\n3. Identity theft insurance: Some identity protection services offer insurance that can help you recover financially if you become a victim of identity theft.\n4. Assistance: Many identity protection services offer assistance if you become a victim of identity theft. They can help you file a police report, contact credit bureaus, and other steps to help you restore your identity.\

In [15]:
base_dataset[range(10)]["text"]

['|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||Feb 1 2013, 09:58 AM|\nDon\'t care about Chloe/Taniel/Jen-Jen. Don\'t care about Sami, really, but hoping that we get some good "SAMANTHA GENE!!" Marlena Death-Stares out of it. And "newfound" feelings. Please. If only.\nSTEFANO!! STEFANO, STEFANO, STEFANO!!!! :cheer:\n|Spoilers for the Week of February 11th · DAYS: News, Spoilers & Discussion|',
 '*sigh* Fundamentalist community, let me pass on some advice to you I learned from the atheistic community:\nIf you have set yourself on fire, do not run.\nOkay? Okay?? Please?\nLook, D, you had two months to say to Harvard in private emails, "Im sorry, I shouldnt have been using that animation in my paid presentations. I wont use it again. I really do like \'Inner Life\', though, and would love to use it in classroom presentations, from the BioVisions site, if that is acceptable."\nI sat here, for two months, waiting for that to happen, anything to happen, and it did

In [16]:
print(
    f"{len(chat_dataset):.2g} chat examples and {len(base_dataset):.2g} base examples"
)

1e+06 chat examples and 1.5e+07 base examples
