# Phi3.5-mini-instruct-UA fine-tuning notebook


## Setup

In [1]:
! pip install ninja packaging
! pip install wandb
! pip install -qqq --upgrade bitsandbytes transformers peft accelerate datasets trl torch flash_attn huggingface_hub

Collecting ninja
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/307.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ninja
Successfully installed ninja-1.11.1.1
Collecting wandb
  Downloading wandb-0.17.9-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-2.14.0-py2.py3-none-any.whl.metadata (9.7 kB)
Collect

In [None]:
from google.colab import userdata
import os

os.environ["HF_HUB_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["WANDB_API_KEY"] = userdata.get('WANDB_API_KEY')

In [None]:
from huggingface_hub import login
import os

login(token=os.getenv("HF_HUB_TOKEN"))

## Data preparation

The datasets used have different formats.
We prepare and mix them in a single dataset.

In [4]:
from datasets import load_dataset, Dataset, concatenate_datasets, Features, Value
from transformers import AutoTokenizer

# Define all columns as strings
features = Features({
    'input': Value('string'),
    'output': Value('string'),
    'instruct': Value('string'),
    'dataset_type': Value('string'),
    'dataloader_name': Value('string')
})

In [5]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct", trust_remote_code=True)
tokenizer.padding_side = 'right'

dataset = load_dataset("ostapbodnar/ua-gec-pos-ner-golden", features=features)

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/597M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/149M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/209M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruct', 'dataset_type', 'dataloader_name'],
        num_rows: 213960
    })
    validation: Dataset({
        features: ['input', 'output', 'instruct', 'dataset_type', 'dataloader_name'],
        num_rows: 53490
    })
    test: Dataset({
        features: ['input', 'output', 'instruct', 'dataset_type', 'dataloader_name'],
        num_rows: 78990
    })
})

In [7]:
len(dataset["test"])

78990

In [8]:
from datasets import DatasetDict

sampled_dataset = DatasetDict({
    "train": dataset["train"].shuffle(seed=42).select(range(30000)),
    "test": dataset["test"].shuffle(seed=42).select(range(1500)),
})
sampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruct', 'dataset_type', 'dataloader_name'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['input', 'output', 'instruct', 'dataset_type', 'dataloader_name'],
        num_rows: 1500
    })
})

In [9]:
def create_message_column(row):
    messages = []
    user = {
        "content": f"{row['instruct']}\n Input: {row['input']}",
        "role": "user"
    }
    messages.append(user)
    assistant = {
        "content": f"{row['output']}",
        "role": "assistant"
    }
    messages.append(assistant)
    return {"messages": messages}

def format_dataset_chatml(row):
    return {"text": tokenizer.apply_chat_template(row["messages"], add_generation_prompt=False, tokenize=False)}

In [10]:
dataset_chatml = sampled_dataset.map(create_message_column)
dataset_chatml = dataset_chatml.map(format_dataset_chatml)

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [11]:
dataset_chatml

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruct', 'dataset_type', 'dataloader_name', 'messages', 'text'],
        num_rows: 30000
    })
    test: Dataset({
        features: ['input', 'output', 'instruct', 'dataset_type', 'dataloader_name', 'messages', 'text'],
        num_rows: 1500
    })
})

In [12]:
print(dataset_chatml["train"][587]["text"])

<|user|>
Ідентифікуй жанр новини на основі тексту.
 Input: Заголовок: {Шахтар} – {Сілекс} ⇒ Дивитися онлайн текстову трансляцію ≺{26.01.2021}≻ {Футбол} на СПОРТ.UA, текст: У вівторок, 26-го січня, відбудеться товариський поєдинок, в якому донецький «Шахтар» зіграє з македонським «Сілексом». Матч пройде в Анталії, початок гри о 10:00. На турецькому зборі діючі чемпіони України провели вже два спаринги: з польським «Лехом» (1:1) і болгарським «Лудогорцем» (2:2). Матч проти «срібного» призера минулого розіграшу чемпіонату Словенії - «Марібора», був скасований через спалах коронавірусу у словенців. «Сілекс» в Туреччині також без перемог. Команда на зборі провела два поєдинки і в обох зазнала поразки. В одному з них - від ковалівського «Колоса» Sport.ua проведе текстову трансляцію матчу «Шахтар» - «Сілекс». За перебігом поєдинку можна слідкувати за цим посиланням.<|end|>
<|assistant|>
спорт<|end|>
<|endoftext|>


We can then check how many examples will be truncated if we choose a maximum length of X tokens (2048 in this case).

In [13]:
# from scipy.stats import percentileofscore
# import multiprocessing

# def calculate_lengths(batch):
#     return {"conv_lengths": [len(tokenizer(text)["input_ids"]) for text in batch["text"]]}

# conv_lengths = dataset_chatml["train"].map(
#     calculate_lengths,
#     batched=True,
#     batch_size=1000,
#     num_proc=multiprocessing.cpu_count()
# )["conv_lengths"]

In [14]:
# chosen_length=2048

# percentile = percentileofscore(conv_lengths, chosen_length)
# print(percentile)

## Load model


In [15]:
# WANDB configuration (optional)

import wandb
import os

os.environ["PROJECT"]="phi3.5-mini-ua-golden"

project_name = os.environ["PROJECT"]

wandb.init(project=project_name, name = project_name)

[34m[1mwandb[0m: Currently logged in as: [33mostapbodnar[0m ([33mostap-bodnar[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [16]:
from random import randrange

import torch
from datasets import load_dataset

from peft import LoraConfig, prepare_model_for_kbit_training, TaskType, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    set_seed,
    pipeline
)
from trl import SFTTrainer

In [17]:
model_name = "microsoft/Phi-3.5-mini-instruct"

In [18]:
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

print(attn_implementation)
print(compute_dtype)

flash_attention_2
torch.bfloat16


In [19]:
device_map = {"": 0}

use_4bit = True
bnb_4bit_compute_dtype = "bfloat16"
bnb_4bit_quant_type = "nf4"
use_double_quant = True

lora_r = 16

lora_alpha = 16

lora_dropout = 0.05

target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]

set_seed(1234)

In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

bnb_config = BitsAndBytesConfig(
        load_in_4bit=use_4bit,
        bnb_4bit_quant_type=bnb_4bit_quant_type,
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=use_double_quant,
)

model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True, quantization_config=bnb_config, device_map=device_map,
          attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model)

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

In [21]:
hf_adapter_repo="ostapbodnar/Phi3.5-mini-instruct-UA-adapter-qlora"

args = TrainingArguments(
        output_dir="./phi-3.5-mini-QLoRA",
        evaluation_strategy="steps",
        do_eval=True,
        optim="adamw_torch",
        hub_model_id=hf_adapter_repo,
        per_device_train_batch_size=3,
        gradient_accumulation_steps=2,
        per_device_eval_batch_size=3,
        log_level="debug",
        save_strategy="steps",
        save_steps=1500,
        logging_steps=1000,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=1000,
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        report_to="wandb",
        seed=42,
)

peft_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        task_type=TaskType.CAUSAL_LM,
        target_modules=target_modules,
)



In [22]:
torch.cuda.empty_cache()

In [23]:
sft_trainer = SFTTrainer(
        model=model,
        train_dataset=dataset_chatml['train'],
        eval_dataset=dataset_chatml['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=4096,
        tokenizer=tokenizer,
        args=args,
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Using auto half precision backend


In [24]:
sft_trainer.train()
sft_trainer.save_model()

Currently training with a batch size of: 3
***** Running training *****
  Num examples = 30,000
  Num Epochs = 3
  Instantaneous batch size per device = 3
  Total train batch size (w. parallel, distributed & accumulation) = 6
  Gradient Accumulation steps = 2
  Total optimization steps = 15,000
  Number of trainable parameters = 8,912,896
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss
1000,2.087,1.725216
2000,1.7907,1.638705
3000,1.7304,1.589013
4000,1.6781,1.559551
5000,1.6757,1.534471
6000,1.614,1.518221
7000,1.604,1.503053
8000,1.5955,1.491941
9000,1.5758,1.4817
10000,1.563,1.474045



***** Running Evaluation *****
  Num examples = 1500
  Batch size = 3
Saving model checkpoint to ./phi-3.5-mini-QLoRA/checkpoint-1500
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-mini-instruct/snapshots/ccf028fc8e1b3ab750a7c55b22792f57ba69f216/config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3.5-mini-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3.5-mini-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3.5-mini-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_he

Step,Training Loss,Validation Loss
1000,2.087,1.725216
2000,1.7907,1.638705
3000,1.7304,1.589013
4000,1.6781,1.559551
5000,1.6757,1.534471
6000,1.614,1.518221
7000,1.604,1.503053
8000,1.5955,1.491941
9000,1.5758,1.4817
10000,1.563,1.474045


Saving model checkpoint to ./phi-3.5-mini-QLoRA/checkpoint-13500
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-mini-instruct/snapshots/ccf028fc8e1b3ab750a7c55b22792f57ba69f216/config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3.5-mini-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3.5-mini-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3.5-mini-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id"

In [25]:
sft_trainer.push_to_hub(hf_adapter_repo)

Saving model checkpoint to ./phi-3.5-mini-QLoRA
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-mini-instruct/snapshots/ccf028fc8e1b3ab750a7c55b22792f57ba69f216/config.json
Model config Phi3Config {
  "_name_or_path": "Phi-3.5-mini-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3.5-mini-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3.5-mini-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "model_type": "phi3",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/35.7M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ostapbodnar/Phi3.5-mini-instruct-UA-adapter-qlora/commit/c5c338ab4cab6cfffd6b3a7d527d7567c0212c1b', commit_message='ostapbodnar/Phi3.5-mini-instruct-UA-adapter-qlora', commit_description='', oid='c5c338ab4cab6cfffd6b3a7d527d7567c0212c1b', pr_url=None, pr_revision=None, pr_num=None)

In [26]:
hf_model_repo = "ostapbodnar/Phi3.5-mini-instruct-UA-qlora"

In [27]:
model_name, hf_adapter_repo, compute_dtype

('microsoft/Phi-3.5-mini-instruct',
 'ostapbodnar/Phi3.5-mini-instruct-UA-adapter-qlora',
 torch.bfloat16)

In [28]:
peft_model_id = hf_adapter_repo
tr_model_id = model_name

model = AutoModelForCausalLM.from_pretrained(tr_model_id, trust_remote_code=True, torch_dtype=compute_dtype)
model = PeftModel.from_pretrained(model, peft_model_id)
model = model.merge_and_unload()

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-mini-instruct/snapshots/ccf028fc8e1b3ab750a7c55b22792f57ba69f216/config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-mini-instruct/snapshots/ccf028fc8e1b3ab750a7c55b22792f57ba69f216/config.json
Model config Phi3Config {
  "_name_or_path": "microsoft/Phi-3.5-mini-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3.5-mini-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3.5-mini-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "model_type": "phi3",
  "num_attenti

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing Phi3ForCausalLM.

All the weights of Phi3ForCausalLM were initialized from the model checkpoint at microsoft/Phi-3.5-mini-instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Phi3ForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-mini-instruct/snapshots/ccf028fc8e1b3ab750a7c55b22792f57ba69f216/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": [
    32007,
    32001,
    32000
  ],
  "pad_token_id": 32000
}



adapter_config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/35.7M [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
tokenizer.padding_side = 'left'

In [30]:
merged_model_id = hf_model_repo
model.push_to_hub(merged_model_id)
tokenizer.push_to_hub(merged_model_id)

Configuration saved in /tmp/tmptt9dh18z/config.json
Configuration saved in /tmp/tmptt9dh18z/generation_config.json
The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at /tmp/tmptt9dh18z/model.safetensors.index.json.
Uploading the following files to ostapbodnar/Phi3.5-mini-instruct-UA-qlora: config.json,model-00002-of-00002.safetensors,generation_config.json,model.safetensors.index.json,model-00001-of-00002.safetensors,README.md


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer config file saved in /tmp/tmpd4mik0kg/tokenizer_config.json
Special tokens file saved in /tmp/tmpd4mik0kg/special_tokens_map.json
Uploading the following files to ostapbodnar/Phi3.5-mini-instruct-UA-qlora: tokenizer.model,tokenizer_config.json,added_tokens.json,special_tokens_map.json,tokenizer.json,README.md


tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/ostapbodnar/Phi3.5-mini-instruct-UA-qlora/commit/d2e8fe143fae4b24652768dd1d65d0db93396f9f', commit_message='Upload tokenizer', commit_description='', oid='d2e8fe143fae4b24652768dd1d65d0db93396f9f', pr_url=None, pr_revision=None, pr_num=None)

In [33]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load your fine-tuned model and tokenizer
def load_model(model_name_or_path):
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, device_map="auto",)
    model = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        trust_remote_code=True,
        use_cache=False,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        device_map="auto",
    )
    return model, tokenizer

model, tokenizer = load_model(hf_model_repo)

tokenizer_config.json:   0%|          | 0.00/3.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.85M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/455 [00:00<?, ?B/s]

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/tokenizer.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/tokenizer_config.json
Special tokens have 

config.json:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/config.json
Model config Phi3Config {
  "_name_or_path": "ostapbodnar/Phi3.5-mini-instruct-UA-qlora",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3.5-mini-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3.5-mini-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "model

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/model.safetensors.index.json


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Instantiating Phi3ForCausalLM model under default dtype torch.bfloat16.
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 32000,
  "pad_token_id": 32000,
  "use_cache": false
}

Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version: 2.6.3
Detected flash_attn version:

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing Phi3ForCausalLM.

All the weights of Phi3ForCausalLM were initialized from the model checkpoint at ostapbodnar/Phi3.5-mini-instruct-UA-qlora.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Phi3ForCausalLM for predictions without further training.


generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--ostapbodnar--Phi3.5-mini-instruct-UA-qlora/snapshots/d2e8fe143fae4b24652768dd1d65d0db93396f9f/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": [
    32007,
    32001,
    32000
  ],
  "pad_token_id": 32000
}



In [34]:

def run_inference(model, tokenizer, input_string, max_length=1024):
    device = next(model.parameters()).device
    messages = [{"role": "user", "content": input_string}]
    inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_length=max_length)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded_output

In [35]:
dataset_chatml['test'][0]

{'input': "Заголовок: Nokia розкрила графік виходу нових смартфонів з 5G, текст: Джерело: NokiaPowerUser.На початку лютого покажуть Nokia 1.4. Пристрій отримає процесор з чотирма ядрами, 1 гігабайт оперативної пам'яті, накопичувач місткістю 16 гігабайтів, 6,51-дюймовий дисплей формату HD+, селфі-камеру на 5 мегапікселів і подвійну основну камеру з датчиками на 8 і 2 мегапікселі. Новинка буде отримувати живлення від акумулятора ємністю 4000 мА·год.Наприкінці поточного або на початку наступного кварталу дебютує модель Nokia 6.3/6.4 5G з підтримкою зв'язку п'ятого покоління. Смартфону приписують наявність процесора Snapdragon 480, дисплея FHD+ розміром 6,4 дюйма по діагоналі, 16-мегапіксельної фронтальної камери і основної камери з 48-мегапіксельним основним датчиком і оптикою ZEISS. Об'єм оперативної пам'яті буде досягати 6 гігабайтів, ємність постійної пам'яті 128 гігабайтів. Живлення забезпечить батарея на 4500 мА·год. Nokia 5.4 та Nokia 7.3 / Фото Nokia Нарешті, в лютому або березні в

In [36]:
input_text = dataset_chatml['test'][0]
text = f"{input_text['instruct']}\n Input: {input_text['input']}"
print(text)
output = run_inference(model, tokenizer, text)
print(output)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


Ідентифікуй основні терміни, що зустрічаються в тексті.
 Input: Заголовок: Nokia розкрила графік виходу нових смартфонів з 5G, текст: Джерело: NokiaPowerUser.На початку лютого покажуть Nokia 1.4. Пристрій отримає процесор з чотирма ядрами, 1 гігабайт оперативної пам'яті, накопичувач місткістю 16 гігабайтів, 6,51-дюймовий дисплей формату HD+, селфі-камеру на 5 мегапікселів і подвійну основну камеру з датчиками на 8 і 2 мегапікселі. Новинка буде отримувати живлення від акумулятора ємністю 4000 мА·год.Наприкінці поточного або на початку наступного кварталу дебютує модель Nokia 6.3/6.4 5G з підтримкою зв'язку п'ятого покоління. Смартфону приписують наявність процесора Snapdragon 480, дисплея FHD+ розміром 6,4 дюйма по діагоналі, 16-мегапіксельної фронтальної камери і основної камери з 48-мегапіксельним основним датчиком і оптикою ZEISS. Об'єм оперативної пам'яті буде досягати 6 гігабайтів, ємність постійної пам'яті 128 гігабайтів. Живлення забезпечить батарея на 4500 мА·год. Nokia 5.4 та N

I finally did some manual updates on the model repo:
- copying some files from the original model to my model...
- modifying config.json and generation_config.json to use the right tokens ids for `eos_token_id`.