# Qwen Fine Tuning Script
- Fine tune a QWEN model
- Show how to output model weights to GGUF Format so we can use it easily via Ollama (useful if we want to host these local models on the StockSensei deployment)

## Part 1: Model Finetuning (HuggingFace, MLX)

In [None]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
model_name = "unsloth/Phi-3-medium-4k-instruct"
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

==((====))==  Unsloth 2024.11.10: Fast Mistral patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!




Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Parameters for finetuning

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing='unsloth',
    random_state=3407,
    use_rslora=False,
    loftq_config=None
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.11.10 patched 40 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [None]:
# Now the dataset
import pandas as pd
import json
import os

def ingest_train_data(directory: str):
    all_data = []
    for file in os.listdir(directory):
        path = os.path.join(directory, file)
        with open(path, 'r') as f:
            data = json.load(f)
            company_name = data['company']
            # drop keys that don't matter
            del_keys = [
                'state_location'
                'cik',
                'company',
                'filing_type',
                'filing_date',
                'period_of_report',
                'sic',
                'state_of_inc',
                'fiscal_year_end',
                'filing_html_index',
                'htm_filing_link',
                'complete_text_filing_link',
                'filename'
            ]
            all_data.extend([(data[text], text, company_name) for text in data.keys() if text not in del_keys])
    return all_data


def create_master_dataset(directory: str):
    df = pd.DataFrame(columns=["prompt", "completion"])
    data = ingest_train_data(directory=directory)
    for completion, item, company in data:
        prompt = f"What did {item} in {company}'s SEC 10-K filing say?"
        df.loc[len(df.index)] = [prompt, completion]
    return df

directory = './data/'
master_set = create_master_dataset(directory=directory)


In [None]:
!pip install datasets



In [None]:
# format dataset how unsloth wants
from unsloth import to_sharegpt
from datasets import Dataset

dataset = Dataset.from_pandas(master_set)

dataset = to_sharegpt(
    dataset,
    merged_prompt = "Your input is: {prompt}",
    output_column_name='completion',
)

from unsloth import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

Merging columns:   0%|          | 0/850 [00:00<?, ? examples/s]

Converting to ShareGPT:   0%|          | 0/850 [00:00<?, ? examples/s]

Standardizing format:   0%|          | 0/850 [00:00<?, ? examples/s]

In [None]:
# Now it's time to start training the model
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        output_dir="outputs2",
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        report_to='none'
    )
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map (num_proc=2):   0%|          | 0/850 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
# Train the model
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 850 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 60
 "-____-"     Number of trainable parameters = 65,536,000


Step,Training Loss
1,1.0653
2,1.3527
3,0.53
4,1.6391
5,1.4867
6,1.0787
7,1.3357
8,1.0647
9,0.5167
10,0.3725


In [None]:
# The Convergence seems acceptable at <0.8 loss, so go ahead and save the model
model.save_pretrained('bds_phi3_ft')
tokenizer.save_pretrained('bds_phi3_ft')

('bds_phi3_ft/tokenizer_config.json',
 'bds_phi3_ft/special_tokens_map.json',
 'bds_phi3_ft/tokenizer.model',
 'bds_phi3_ft/added_tokens.json',
 'bds_phi3_ft/tokenizer.json')

## Part 2: Saving to Ollama and Running Eval

In [None]:
# First we have to install ollama
%%capture
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
# Save this in 8B for now (faster)
model.save_pretrained_gguf("ft_model", tokenizer)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 7.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1.64 out of 12.67 RAM for saving.


 18%|█▊        | 7/40 [00:01<00:04,  7.39it/s]We will save to Disk and not RAM now.
100%|██████████| 40/40 [19:30<00:00, 29.26s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving ft_model/pytorch_model-00001-of-00006.bin...
Unsloth: Saving ft_model/pytorch_model-00002-of-00006.bin...
Unsloth: Saving ft_model/pytorch_model-00003-of-00006.bin...
Unsloth: Saving ft_model/pytorch_model-00004-of-00006.bin...
Unsloth: Saving ft_model/pytorch_model-00005-of-00006.bin...
Unsloth: Saving ft_model/pytorch_model-00006-of-00006.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at ft_model into q8_0 GGUF format.
The output location will be /content/ft_model/unsloth.Q8_0.gguf
This will take 3 minutes...


Unsloth: Extending ft_model/tokenizer.model with added_tokens.json.
Originally tokenizer.model is of size (32000).
But we need to extend to sentencepiece vocab size (32011).
  self.stdout = io.open(c2pread, 'rb', bufsize)


INFO:hf-to-gguf:Loading model: ft_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00006.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> Q8_0, shape = {5120, 32064}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.float16 --> Q8_0, shape = {5120, 5120}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> Q8_0, shape = {5120, 1280}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.float16 --> Q8_0, shape = {5120, 1280}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> Q8_0, shape = {5120, 5120}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> Q8_0, shape = {5120, 17920}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> Q8_0, shape = {5120, 17920}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> Q

In [None]:
# Run Ollama in the background
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(5)

In [None]:
# Try to create the model file from the GGUF thing
with open('/content/ft_model/Modelfile', 'w') as f:
    f.write("FROM /content/ft_model/unsloth.Q8_0.gguf")

In [37]:
!ollama create ft_model_phi -f ./ft_model/Modelfile

[?25ltransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [

In [1]:
# Start running inference using eval dataset
%%capture
!pip install -U langchain-ollama

In [38]:
prompt = """
You are a financial advisor responsible for helping train an AI language model
to provide comprehensive, sound financial advice based on a company's financial
history. You are tasked with writing questions and ground-truth answers for the
task's benchmark dataset.

You will be provided a set of historical data on a given company. Given this data,
you should come up with a question that would effectively test an LLM's ability to
give coherent and correct information about a company. The LLM may also be asked to
give some subjective advice about a company's financial outlook. In these cases, while
there isn't necessarily a "correct" answer, any LLM answer should be supported clearly
by the provided data. The questions you create should have these goals in mind, and the
answers you generate should appropriately address the goals.

Format your output in the following format:

Do not include anything else in your response.

Here is an example of what your output could look like:

<<Example>>

What do AAPL's earnings reports say about it's growth potential?

Investors can be confident about AAPL's long-term growth potential. It has showed
consistent growth year-over-year, with revenue figures increasing by at least 2 percent
in every year.

Here is the user input:

{query}

Don't wrap the JSON output in anything (markdown, etc). Just return the JSON object itself.
"""

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from tqdm import tqdm
import pandas as pd

prompt = ChatPromptTemplate.from_template(prompt)

model = OllamaLLM(model="ft_model_phi")

chain = prompt | model

eval_dataset = pd.read_csv('eval_dataset.csv')

idno = 0
answers = []

for question in tqdm(eval_dataset["Question"]):
  output = chain.invoke({'query': question})
  answers.append({
      "no": idno,
      "question":question,
      "answer": output
  })
  idno += 1

final = pd.DataFrame(answers)
final.to_csv("phi-3-ft-output.csv", index=False)