<a href="https://colab.research.google.com/github/levelup-apps/hn-enhancer/blob/main/scripts/python/colab-notebook-finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finetune AI summarization of HN comments - HN Companion
This notebook contains the code to finetune the task of summarizing HN comments.

In the HNCompanion browser extension, we use LLMs like OpenAI, Claude or any model to generate summaries with a big system prompt. The results were varied based on the models, but by far, GPT-4, o1-mini, Claude and DeepSeek have given the best results in terms of meaningful summary and proper backlinks to link from summary to the relevant threads. But these models are expensive, especially to run continuosuly (multiple times a day) to summarize HN posts. We believe that we can get finetune a less expensive open model (DeepSeek distilled, Llama etc.) with good data generated by expensive frontier models, we will ge better results from the less capable model.

**Our hypothesis** - Finetuning a generic lower-end models on AI summaries generated by more powerful and expensive models will improve the quality of output generated by the lower-end models.

The concept is similar to **knowledge distillation**, where a smaller model learns from a more powerful one. We have access to high-quality training data from top models that demonstrate the desired behavior. We use that data to teach a specific, bounded task (HN post summarization) rather than general capabilities.

If our hypothesis can be proven, we have a great solution that balances cost and performance. Then we can generate summaries in advance (on top 30 / 60 posts), cache them and serve them. This will grealty inprove the user experience and performance of 'summarize' workflows.

**Potential benefits** - Cost reduction for inference, faster inference times, ability to run summaries proactively rather than on-demand, more consistent output format and quality.

### General Approach

**Building dataset**
* ensure diversity in HN posts (technical, business, general discussion)
* include different post lengths and styles
* capture edge cases in the dataset

**Select base model**
* select a base model that is capable enough to learn the task
* base model should handle technical vocabulary well
* must be able to process longer inputs effectively

**Train & Evaluate**
* start with a small dataset (500 - 1000 examples)
* try with different prompt formats, validate with a test set
* evaluate accuracy of backlinks, consistency
* preservation of technical vocab and HN vibe
* do human eval of sample set if needed

This notebook is a customized version of Unsloth example to finetune any model to have a chat conversation and export it in GGUF format to run on Ollama.


## How to run this notebook
Select each cell and run or select Runtime and Run All.

Finetuning happens in a few steps:

1. [Select the base model](#Select)
2. [Load the training dataset from HuggingFace](#Data)
3. [Train the model](#Train)
4. [Run the model through inference](#Inference)
5. [Export the finetuned model to Ollama](#Ollama)

### References
* Install Unsloth on your own computer - [Unsloth Github page](https://github.com/unslothai/unsloth#installation-instructions---conda).

### Prepare the environment

In [None]:
#@title Install dependencies
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git



<a name="Select"></a>
### Select the model
We will finetune first with Lllama-3 8-bit and then with DeepSeek distill model.

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096 # Context length.
  # TODO: Increase to 8192 or higher to match the context window length of final model.
  # Refer https://unsloth.ai/blog/long-context. RTX 4090 can take upto 56K context length (input token)

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Set to False for higher (1-2%) accuracy

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

# base_model_name = "unsloth/llama-3-8b-bnb-4bit"
# base_model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit"
# base_model_name = "unsloth/Llama-3.3-70B-Instruct-bnb-4bit"
base_model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"

base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

print(f"\nInitialized base model - name: {base_model.name_or_path}, max_seq_length: {max_seq_length}")

==((====))==  Unsloth 2025.2.9: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Initialized base model - name: unsloth/llama-3.2-3b-instruct-bnb-4bit, max_seq_length: 4096


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

LoRA = **Low Rank Adaptation**, the highly efficient finetuning technique where you make small targeted adjustments to the base model in order to customize it to do specific tasks.

You adjust only a small set of parameters (low-rank matrices) which are applied on top of the base model to get results specific to the new context. This matrix adjusts for the weights of the original model so that the results match the desired outputs.

In [None]:
lora_model = FastLanguageModel.get_peft_model(
    base_model,
    r = 16, # Rank of the finetuning process. We can choose any number > 0. Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16, # Scaling factor for finetuning. A large number may cause overfitting.
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context (uses 30% less VRAM, fits 2x larger batch sizes!)
    random_state = 3407, # number to determine deterministic runs
    use_rslora = False,  # rslora: rank stabilized LoRA
    loftq_config = None, # LoftQ - Advanced feature, can improve accuracy somewhat, but can make memory usage explode at the start.
)

<a name="Data"></a>
### Data preparation
We now use the HN comments dataset from HuggingFace, which is the collection of comments/summapry pairs for many HN posts.

In [None]:
from datasets import load_dataset
import json
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')

# hf_dataset_id = "annjose/hn-comments"
hf_dataset_id = "annjose/hn-comments-small"

train_dataset = load_dataset(hf_dataset_id, split = "train", token = hf_token)
print(f"Train dataset size: {len(train_dataset)}")

val_dataset = load_dataset(hf_dataset_id, split = "validation", token = hf_token)
print(f"Validation dataset size: {len(val_dataset)}")

test_dataset = load_dataset(hf_dataset_id, split = "test", token = hf_token)
print(f"Test dataset size: {len(test_dataset)}")

# Define the post_ids to exclude (these two posts have > 120K character length)
# This is the list of post ids with the length of their final text (prompt + comments) and token length
#   Row 1: post_id: 42608155. Text length: 146182.	Token count: 38724
#   Row 2: post_id: 42607623. Text length: 124933.	Token count: 29953
#   Row 3: post_id: 42611536. Text length: 121611.	Token count: 34908
#   Row 4: post_id: 42609819. Text length:  12748.	Token count:  3168
#   Row 5: post_id: 42607794. Text length:  17506.	Token count:  4684
#   Row 6: post_id: 42609595. Text length:  31676.	Token count:  7694

# this exclusion works - training runs ok, but inference is not meaningful
# exclude_ids = ['42608155', '42607623', '42611536',]

# this exclusion did not work on L4 TPU with max-seq_ 2048
# exclude_ids = ['42608155', '42607623']

# exclude_ids = ['42608155'] # this didn't work even with max_context_length 36K

# in the smaller dataset, exclude two posts with 3000+ token length
exclude_ids = [] # ['42889786', '42901616']

# Filter the dataset to keep only rows where post_id is not in exclude_ids
filtered_dataset = train_dataset.filter(lambda x: x['post_id'] not in exclude_ids)

print(f"\nFiltered out {len(exclude_ids)} items from training dataset")
print(f"Original dataset size: {len(train_dataset)}")
print(f"Filtered dataset size: {len(filtered_dataset)}")

# dataset_hn = train_dataset
dataset_hn = filtered_dataset
print(f"\nHN dataset: {dataset_hn}")
print(f"selected post_ids: {filtered_dataset['post_id']}")

# print("\nFirst data row...")
# print(json.dumps(dataset_hn[0], indent=2))

Train dataset size: 6
Validation dataset size: 2
Test dataset size: 1

Filtered out 0 items from training dataset
Original dataset size: 6
Filtered dataset size: 6

HN dataset: Dataset({
    features: ['post_id', 'input_comment', 'output_summary'],
    num_rows: 6
})
selected post_ids: ['42803774', '42931109', '42901616', '42684257', '42889786', '42681762']


### Data Formatting - LLama 3.3 Instruct


In [None]:
# RUN THIS - Format the dataset to match the base model's template and validate it.

llama_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>

{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{OUTPUT}<|eot_id|>"""

prompt_template = llama_template

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

system_prompt = """You are an AI assistant specialized in analyzing and summarizing Hacker News discussions.
A discussion consists of threaded comments where each comment can have child comments (replies) nested underneath it,
forming interconnected conversation branches. Your task is to provide concise, meaningful summaries that capture the
essence of the discussion while prioritizing engaging and high quality content."""

user_prompt_prefix = "This is your input:\n The title of the post and comments are separated by dashed lines."

def format_prompts_func(examples):

    try:
        post_ids = examples["post_id"]
        input_comments = examples["input_comment"]
        summaries = examples["output_summary"]

        formatted_texts = []
        for post_id, comment, summary in zip(post_ids, input_comments, summaries):
            # Validate inputs
            if not all([comment, summary]):
                continue

            # Create the prompt by combining system prompt and user prompt (comment)
            user_prompt = f"{user_prompt_prefix}\n{comment}"

            # Apply the data (prompt and summary) to the template format
            formatted_text = prompt_template.format(
                SYSTEM=system_prompt,
                INPUT=user_prompt,
                OUTPUT=summary
            ) + EOS_TOKEN

            formatted_texts.append(formatted_text)

        return {"text": formatted_texts}
    except Exception as e:
        print(f"Error formatting prompts: {e}")
        raise

print(f"BEFORE formatting: Column Names: {dataset_hn.column_names}. Row count: {dataset_hn.num_rows}")

# Format the text in the dataset according to the base model's prompt template
dataset = dataset_hn.map(format_prompts_func, batched = True,)

print(f"AFTER formatting: Column Names: {dataset.column_names}. Row count: {dataset.num_rows}")

import json
print("\ndataset[0]...")
print(json.dumps(dataset[0], indent=2))

# print the length of 'text' property in the dataset
print("\nLength of 'text' property in the datasets:")
total_token_count = 0
longest_token_length = 0 # longest token

for idx, row in enumerate(dataset):
  text = row['text']
  tokens = tokenizer.encode(text)

  token_count = len(tokens)
  total_token_count += token_count
  longest_token_length = max(longest_token_length, token_count)

  print(f"Row {idx + 1}: post_id: {row['post_id']}. Text length: {len(text)}.\tToken count: {token_count}")


print(f"\nTotal tokens in dataset: {total_token_count}. Longest token length in dataset: {longest_token_length}. Model's max token length: {max_seq_length}")


BEFORE formatting: Column Names: ['post_id', 'input_comment', 'output_summary']. Row count: 6
AFTER formatting: Column Names: ['post_id', 'input_comment', 'output_summary', 'text']. Row count: 6

dataset[0]...
{
  "post_id": "42803774",
  "input_comment": "---- Post Title: \nAn overview of gradient descent optimization algorithms (2016)\n----- Comments: \n[1] (score: 1000) <replies: 3> janalsncm: Article is from 2016. It only mentions AdamW at the very end in passing. These days I rarely see much besides AdamW in production.Messing with optimizers is one of the ways to enter hyperparameter hell: it\u2019s like legacy code but on steroids because changing it only breaks your training code stochastically. Much better to stop worrying and love AdamW.\n[1.1] (score: 961) <replies: 0> nkurz: The mention of AdamW is brief, but in his defense he includes a link that gives a gloss of it: \"An updated overview of recent gradient descent algorithms\" [].\n[1.2] (score: 923) <replies: 1> pizza: L

<a name="Train"></a>
### Train the model
Use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
#@title Create SFT trainer

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

import wandb

# Initialize Weights and Biases for reporting

# if there is a pending run, finish it before login again
if wandb.run is not None:
    wandb.finish()
    print("Finished previous run.")

wandb.login()

current_run_name = "llama_4096_6posts_below_4k_tokensize"
run = wandb.init(
    project = 'HN-Summarize FineTune DeepSeek-R1-Distill-Llama-8B using HN Comments Data',
    name = current_run_name,
)

# TODO: Review all these params, change these - refer https://docs.unsloth.ai/basics/tutorial-how-to-finetune-llama-3-and-use-in-ollama
# check if we want to increase gradient_accumulation_steps
# replace max_steps with num_train_epochs: 1 or 3. We normally suggest 1 to 3 passes, and no more, otherwise you will over-fit your finetune.
# learning_rate: adjust based on observing the output. Your job is to set parameters to make this go to as close to 0.5 as possible!

# Create SFTTrainer with the base model, tokenizer and our formatted dataset
trainer = SFTTrainer(
    model = lora_model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,  # for full training runs, comment this and uncomment num_train_epochs
        # num_train_epochs = 1, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb", # Use this for WandB etc,
        run_name = current_run_name,
    ),
)

Map (num_proc=2):   0%|          | 0/6 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
4.994 GB of memory reserved.


In [None]:
#@title Start the training
print(f"\nLongest token length in dataset: {longest_token_length}. Model's max token length: {max_seq_length}\n\n")

trainer_stats = trainer.train()

# finish logging in Weights and Biases
wandb.finish()

# set parameters to make the tranining loss go to as close to 0.5 as possible!


Longest token length in dataset: 3841. Model's max token length: 4096




==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 6 | Num Epochs = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss


0,1
train/epoch,▁
train/global_step,▁

0,1
total_flos,1.3589514342580224e+16
train/epoch,0.0
train/global_step,0.0
train_loss,138158.79822
train_runtime,1157.6194
train_samples_per_second,0.415
train_steps_per_second,0.052


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)

print(f"Time taken for training: {trainer_stats.metrics['train_runtime']} seconds, {round(trainer_stats.metrics['train_runtime']/60, 2)} minutes.\n")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Time taken for training: 1157.6194 seconds, 19.29 minutes.

Peak reserved memory = 4.994 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 33.878 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! Unsloth makes inference natively 2x faster as well! You should use prompts which are similar to the ones you had finetuned on, otherwise you might get bad results!

In [None]:
# Load validation data from HF and run inference on the fine-tuned model
import os

# test_post_id = "42608436"
test_post_id = "42864221"  # from the small dataset

# Load test data from your HuggingFace validation dataset
dataset = load_dataset(hf_dataset_id, split="validation", token=hf_token)

test_data_row = dataset.filter(lambda x: x['post_id'] == test_post_id)
if len(test_data_row) == 0:
    print(f"No test data found for post_id: {test_post_id}")
    raise ValueError(f"Test data not found for post_id: {test_post_id}")
else:
  print(f"Test data found with post_id: {test_data_row[0]['post_id']}")

test_input_comment = test_data_row[0]['input_comment'] or ("Empty Comment 0000")

user_prompt = f"{user_prompt_prefix}\n{test_input_comment}"

# Load the fine-tuned model for inference
FastLanguageModel.for_inference(lora_model) # Enable native 2x faster inference

messages = [
      {"role": "system", "content": f"{system_prompt}"},
      {"role": "user", "content": f"{user_prompt}"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

# print("Tokenized input text ... (should be in this format: <｜begin▁of▁sentence｜><｜User｜>...user_prompt...<｜Assistant｜>")
print("Tokenized input text ... (should be in this format: |begin_of_text|><|start_header_id|>system<|end_header_id|>..system..prompt...<|start_header_id|>user<|end_header_id|>...user_prompt...")
decoded_input = tokenizer.batch_decode(input_ids)
print(decoded_input)

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)

print("\n\n...Inference: Generating summary using the fine-tuned model....")
print(f"\n ===== Output from the fine-tuned model ====")
_ = lora_model.generate(input_ids, streamer = text_streamer, max_new_tokens = 1024, pad_token_id = tokenizer.eos_token_id)

# Load the base model for inference
FastLanguageModel.for_inference(base_model) # Enable native 2x faster inference

print(f"\n\n...Inference: Generating summary using the base model {base_model_name} ....")
print(f"\n ===== Output from the Base model ====")
_ = base_model.generate(input_ids, streamer = text_streamer, max_new_tokens = 1024, pad_token_id = tokenizer.eos_token_id)

Filter:   0%|          | 0/2 [00:00<?, ? examples/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Test data found with post_id: 42864221
Tokenized input text ... (should be in this format: |begin_of_text|><|start_header_id|>system<|end_header_id|>..system..prompt...<|start_header_id|>user<|end_header_id|>...user_prompt...
['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Feb 2025\n\nYou are an AI assistant specialized in analyzing and summarizing Hacker News discussions.\nA discussion consists of threaded comments where each comment can have child comments (replies) nested underneath it,\nforming interconnected conversation branches. Your task is to provide concise, meaningful summaries that capture the\nessence of the discussion while prioritizing engaging and high quality content.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nThis is your input:\n The title of the post and comments are separated by dashed lines.\n---- Post Title: \nThe doctor who gave himself an ulcer and solved a medical mystery (2010)\n

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
lora_model.save_pretrained("hn_llama_lora_model") # Local saving
tokenizer.save_pretrained("hn_llama_lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('hn_llama_lora_model/tokenizer_config.json',
 'hn_llama_lora_model/special_tokens_map.json',
 'hn_llama_lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
load_saved_model = False
#if False:
if load_saved_model:
    lora_model_name = "hn_llama_lora_model" # YOUR MODEL YOU USED FOR TRAINING
    print("loading saved model")
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = lora_model_name,
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass

filename = "test_201_comments.md"
test_data = load_test_data_from_file(filename) or "hello world"
# print(f"\nTest data:\n{test_data}\n")

messages = [                    # Change below!
    {"role": "user", "content": "Summarize this text. \n Your input is: {test_data}"},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

<a name="Ollama"></a>
### Ollama Support

[Unsloth](https://github.com/unslothai/unsloth) now allows you to automatically finetune and create a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md), and export to [Ollama](https://ollama.com/)! This makes finetuning much easier and provides a seamless workflow from `Unsloth` to `Ollama`!

Let's first install `Ollama`!

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Next, we shall save the model to GGUF / llama.cpp

We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

We also support saving to multiple GGUF options in a list fashion! This can speed things up by 10 minutes or more if you want multiple export formats!

In [None]:
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

We use `subprocess` to start `Ollama` up in a non blocking fashion! In your own desktop, you can simply open up a new `terminal` and type `ollama serve`, but in Colab, we have to use this hack!

In [None]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3) # Wait for a few seconds for Ollama to load!

`Ollama` needs a `Modelfile`, which specifies the model's prompt format. Let's print Unsloth's auto generated one:

In [None]:
print(tokenizer._ollama_modelfile)

FROM {__FILE_LOCATION__}

TEMPLATE """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.{{ if .Prompt }}

### Instruction:
{{ .Prompt }}{{ end }}

### Response:
{{ .Response }}<|end_of_text|>"""

PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_"
PARAMETER temperature 1.5
PARAMETER min_p 0.1


We now will create an `Ollama` model called `unsloth_model` using the `Modelfile` which we auto generated!

In [None]:
!ollama create unsloth_model -f ./model/Modelfile

[?25ltransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [

And now we can do inference on it via `Ollama`!

You can also upload to `Ollama` and try the `Ollama` Desktop app by heading to https://www.ollama.com/

In [None]:
!curl http://localhost:11434/api/chat -d '{ \
    "model": "unsloth_model", \
    "messages": [ \
        { "role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8," } \
    ] \
    }'

{"model":"unsloth_model","created_at":"2024-10-01T06:47:04.241326628Z","message":{"role":"assistant","content":"The"},"done":false}
{"model":"unsloth_model","created_at":"2024-10-01T06:47:04.465575479Z","message":{"role":"assistant","content":" next"},"done":false}
{"model":"unsloth_model","created_at":"2024-10-01T06:47:04.760101468Z","message":{"role":"assistant","content":" number"},"done":false}
{"model":"unsloth_model","created_at":"2024-10-01T06:47:05.051240606Z","message":{"role":"assistant","content":" in"},"done":false}
{"model":"unsloth_model","created_at":"2024-10-01T06:47:05.376545126Z","message":{"role":"assistant","content":" the"},"done":false}
{"model":"unsloth_model","created_at":"2024-10-01T06:47:05.515751946Z","message":{"role":"assistant","content":" Fibonacci"},"done":false}
{"model":"unsloth_model","created_at":"2024-10-01T06:47:05.658721744Z","message":{"role":"assistant","content":" sequence"},"done":false}
{"model":"unsloth_model","created_at":"2024-10-01T06:47:

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Try our [Ollama CSV notebook](https://colab.research.google.com/drive/1VYkncZMfGFkeCEgN2IzbZIKEDkyQuJAS?usp=sharing) to upload CSVs for finetuning!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://ollama.com/"><img src="https://raw.githubusercontent.com/unslothai/unsloth/nightly/images/ollama.png" height="44"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>