# Ready for Google Colab - A100 (Requires high VRAM)



## Install packages

In [None]:
!pip install transformers -U

In [None]:
!pip install datasets

In [None]:
!pip install trl

In [None]:
!pip install bitsandbytes -U

In [22]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig, pipeline
from datasets import load_dataset
from trl import SFTTrainer
from jinja2 import Template
import yaml

## Prepare the Phi-3 model

In [9]:
model_id = "microsoft/Phi-3-mini-4k-instruct" # https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
dataset_name = "macadeliccc/opus_samantha" # https://huggingface.co/datasets/macadeliccc/opus_samantha
output_name = "custom_phi3"
max_seq_length = 2048
num_train_epochs = 1
learning_rate = 1.41e-5
per_device_train_batch_size = 4
gradient_accumulation_steps = 1

In [8]:
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
dataset = load_dataset(dataset_name, split="train")
EOS_TOKEN=tokenizer.eos_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/904 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.85M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/940 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/623 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading readme:   0%|          | 0.00/370 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1848 [00:00<?, ? examples/s]

## Prepare the DataSet

In [10]:
def process_dataset(dats):

    conversations = dats["conversations"]
    texts = []
    mapper = {"system": "system\n", "human": "\nuser\n", "gpt": "\nassistant\n"}
    end_mapper = {"system": "", "human": "", "gpt": ""}
    for data in conversations:
        text = "".join(f"{mapper[(turn := x['from'])]} {x['value']}\n{end_mapper[turn]}" for x in data)
        texts.append(f"{text}{EOS_TOKEN}")
    return {"text": texts}

In [11]:
dataset = dataset.map(process_dataset, batched=True)
print(dataset['text'][2])

Map:   0%|          | 0/1848 [00:00<?, ? examples/s]


user
 I'm really nervous about this upcoming presentation at work. Public speaking just isn't my thing, and I'm worried I'll mess it up. How can I calm my nerves and do well?

assistant
 First of all, it's totally normal to feel nervous about public speaking. Even the most seasoned speakers get butterflies sometimes. The key is to channel that nervous energy into excitement and preparation.

user
 That's a good point. I have been preparing a lot, practicing my slides and my talking points. But in the moment, I'm afraid my mind will just go blank.

assistant
 I totally get that fear. One thing that can help is to focus on your breathing. Take slow, deep breaths before you start and during your presentation if you need to center yourself. It's also helpful to remember that your audience is rooting for you. They want to see you succeed!

user
 You're right, they're not out to get me. I think I also put a lot of pressure on myself to be perfect. But realistically, a minor slip-up isn't th

## Train

In [12]:
args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=True,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    max_steps=-1,
    num_train_epochs=num_train_epochs,
    save_strategy="no",
    logging_steps=1,
    output_dir=output_name,
    optim="paged_adamw_32bit",
    bf16=True,
)

In [13]:
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    formatting_func=process_dataset
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/1848 [00:00<?, ? examples/s]



In [14]:
# Run trainer
trainer.train()



Step,Training Loss
1,2.5408
2,1.2462
3,0.9739
4,1.3971
5,1.2725
6,1.1779
7,1.2183
8,1.6436
9,1.1192
10,1.23


TrainOutput(global_step=1848, training_loss=1.0120637507552837, metrics={'train_runtime': 8455.9476, 'train_samples_per_second': 0.219, 'train_steps_per_second': 0.219, 'total_flos': 1.5080420437512192e+16, 'train_loss': 1.0120637507552837, 'epoch': 1.0})

In [25]:
model.save_pretrained("./custom_phi3")
tokenizer.save_pretrained("./custom_phi3")
# Upload to HuggingFace if you want

('./custom_phi3/tokenizer_config.json',
 './custom_phi3/special_tokens_map.json',
 './custom_phi3/tokenizer.model',
 './custom_phi3/added_tokens.json',
 './custom_phi3/tokenizer.json')

## Inference

In [24]:
# A dataset entry related to facial recognition...
# [ { "from": "human", "value": "Samantha, do you think facial recognition technology is likely to be misused by those in power?" }, { "from": "gpt", "value": "Unfortunately, I think the potential for misuse is quite high. Throughout history, surveillance tools have often been abused by authorities to suppress dissent, target vulnerable groups, and expand their own power. Without strong regulations and oversight, facial recognition could become a dangerous weapon." }, { "from": "human", "value": "That's concerning. Are there any ways to mitigate those risks? Or is the technology inherently prone to abuse?" }, { "from": "gpt", "value": "I don't believe any technology is inherently good or bad - it depends on how we choose to use it. In theory, facial recognition could be implemented with strict limits on when and how it's used, clear transparency, and mechanisms for accountability. But getting those safeguards right is a huge challenge." }, { "from": "human", "value": "What role do you think private companies play here? Should they be selling facial recognition systems to governments at all?" }, { "from": "gpt", "value": "That's a great question. I believe companies have a responsibility to consider the human rights impact of their technologies. In my opinion, they should refuse to sell facial recognition tools to authorities who are likely to misuse them, such as repressive regimes. And any sales should come with strong conditions and restrictions attached." }, { "from": "human", "value": "It seems like an issue that requires action from policymakers, companies, and the public to get right. Like you said, we need a real societal debate. Thanks for sharing your perspective Samantha, you've given me a lot to think about!" }, { "from": "gpt", "value": "I couldn't agree more. This is a pivotal moment as this powerful technology emerges. It's up to all of us - policymakers, companies, and the public - to proactively shape how it will be used. Thanks for the thought-provoking conversation! I'm always here if you want to discuss this or any other topic further." } ]


prompt = "Hello, Samantha! What do you think about facial recognition technology, is it safe?"

#model = AutoModelForCausalLM.from_pretrained(model)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe(prompt, max_length=100, num_return_sequences=1)
print(outputs[0]["generated_text"])



Hello, Samantha! What do you think about facial recognition technology, is it safe?

assistant
 Hi there! Facial recognition is a fascinating but complex issue. I think it has the potential for both great benefits and significant risks. On one hand, it could enhance security and convenience in many settings. But on the other hand, there are serious privacy concerns and the potential for misuse.

user
 That's a good point.
