## Mistral7B sft on HuggingFaceH4/no_robots

### Load data + Preprocessing

In [3]:
from datasets import load_dataset, DatasetDict

In [4]:
raw_dataset = load_dataset("HuggingFaceH4/no_robots")
raw_dataset

Downloading readme:   0%|          | 0.00/5.61k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/571k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'prompt_id', 'messages', 'category'],
        num_rows: 9500
    })
    test: Dataset({
        features: ['prompt', 'prompt_id', 'messages', 'category'],
        num_rows: 500
    })
})

In [6]:
indice_1 = range(0,100)
indice_2 = range(101, 201)

dataset_dict = {
    "train": raw_dataset["train"].select(indice_1),
    "test": raw_dataset["test"].select(indice_2)
}
raw_dataset = DatasetDict(dataset_dict)
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'prompt_id', 'messages', 'category'],
        num_rows: 100
    })
    test: Dataset({
        features: ['prompt', 'prompt_id', 'messages', 'category'],
        num_rows: 100
    })
})

In [7]:
raw_dataset["train"][0]

{'prompt': 'Please summarize the goals for scientists in this text:\n\nWithin three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert’s portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition—which was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat that has long served as critical refuge for imperiled birds and animals as adjacent marshes flood more with rising sea levels. “We can’t ask restoration ecologists to plant nonnative species or to 

In [21]:
from transformers import AutoTokenizer
from huggingface_hub import login
model_id = 'mistralai/Mistral-7B-Instruct-v0.1'

In [22]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.model_max_length = 2048

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [24]:
tokenizer.chat_template

"{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content'] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n        {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n    {%- endif %}\n    {%- if message['role'] == 'user' %}\n        {%- if loop.first and system_message is defined %}\n            {{- ' [INST] ' + system_message + '\\n\\n' + message['content'] + ' [/INST]' }}\n        {%- else %}\n            {{- ' [INST] ' + message['content'] + ' [/INST]' }}\n        {%- endif %}\n    {%- elif message['role'] == 'assistant' %}\n        {{- ' ' + message['content'] + eos_token}}\n    {%- else %}\n        {{- raise_exception('Only user and assistant roles are supported, with the exc

In [26]:
import re
import random
from multiprocessing import cpu_count


def apply_chat_template(example, tokenizer):
    messages = example["messages"]

    example["text"] = tokenizer.apply_chat_template(messages, tokenize=False)

    return example


In [27]:
column_names = list(raw_dataset["train"].features)
column_names

['prompt', 'prompt_id', 'messages', 'category']

In [29]:
raw_dataset = raw_dataset.map(apply_chat_template,
                              num_proc=cpu_count(),
                              fn_kwargs={"tokenizer": tokenizer},
                              remove_columns=column_names,
                              desc="Apply chat template"
)

Apply chat template (num_proc=8):   0%|          | 0/100 [00:00<?, ? examples/s]

Apply chat template (num_proc=8):   0%|          | 0/100 [00:00<?, ? examples/s]

In [34]:
train_dataset = raw_dataset["train"]
test_datase = raw_dataset["test"]

from pprint import pprint

pprint(train_dataset[1]["text"])

('<s> [INST] Help write a letter of 100 -200 words to my future self for Kyra, '
 'reflecting on her goals and aspirations. [/INST] Dear Future Self,\n'
 '\n'
 "I hope you're happy and proud of what you've achieved. As I write this, I'm "
 "excited to think about our goals and how far you've come. One goal was to be "
 "a machine learning engineer. I hope you've worked hard and become skilled in "
 'this field. Keep learning and innovating. Traveling was important to us. I '
 "hope you've seen different places and enjoyed the beauty of our world. "
 'Remember the memories and lessons. Starting a family mattered to us. If you '
 'have kids, treasure every moment. Be patient, loving, and grateful for your '
 'family.\n'
 '\n'
 'Take care of yourself. Rest, reflect, and cherish the time you spend with '
 "loved ones. Remember your dreams and celebrate what you've achieved. Your "
 "determination brought you here. I'm excited to see the person you've become, "
 "the impact you've made, and

### Model Definition

In [35]:
from transformers import BitsAndBytesConfig
import torch

In [36]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)