# Export fine-tuned model, inference and comparison with base mode

We will see how we can export/push trained model to hugging face hub. We will then quantize and convert it to GGUF format. Then use ollama to get inference from the GGUF model. We will also compare output of base and fine-tuned model for exact same input to see if fine-tuned has any effect or not.

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Loading trained model

Past the zip file for the LoRA weights in same directory and unzip it (zipping is needed to make it easy with google colab).

In [None]:
!unzip react_model

Archive:  react_model.zip
   creating: react_model/
  inflating: react_model/README.md   
  inflating: react_model/special_tokens_map.json  
  inflating: react_model/adapter_model.safetensors  
  inflating: react_model/tokenizer_config.json  
  inflating: react_model/tokenizer.json  
  inflating: react_model/adapter_config.json  


The next step will download the base model needed along with LoRA weights to load the complete model.

In [None]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "react_model",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.4: Fast Qwen2 patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.81G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.2.4 patched 28 layers with 28 QKV layers, 28 O layers and 0 MLP layers.


# Get inferences from trained model

In [None]:
from transformers import TextStreamer

FastLanguageModel.for_inference(model)

messages= [
  {
    "role": "system",
    "content": "You run in a loop of Thought, Action, Observation.\nAt the end of the loop, you output an Answer.\n\nUse Thought to describe your thoughts about the question you have been asked.\nUse Action to run one of the actions available to you - then wait for getting the observation of action in next message.\nObservation will be the result of running those actions.\n\nIf the tool requires arguments and the user has not provided them, **ask the user for the missing details** instead of assuming values. Do not make arbitrary selections when multiple options are available—ask the user to choose.\n\nYour available actions are:\n```\ncheck_weather:\nusage - check_weather: <city: str>\ne.g. check_weather: New York\nReturns the current weather for the given city.\n\nset_reminder:\nusage - set_reminder: <time: str>, <task: str>\ne.g. set_reminder: 8:00 AM, Meeting with client\nSets a reminder for the specified time and task.\n```\n"
  },
  {
    "role": "user",
    "content": "Can you help be set a reminder?"
  }
]


inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)

<｜begin▁of▁sentence｜>You run in a loop of Thought, Action, Observation.
At the end of the loop, you output an Answer.

Use Thought to describe your thoughts about the question you have been asked.
Use Action to run one of the actions available to you - then wait for getting the observation of action in next message.
Observation will be the result of running those actions.

If the tool requires arguments and the user has not provided them, **ask the user for the missing details** instead of assuming values. Do not make arbitrary selections when multiple options are available—ask the user to choose.

Your available actions are:
```
check_weather:
usage - check_weather: <city: str>
e.g. check_weather: New York
Returns the current weather for the given city.

set_reminder:
usage - set_reminder: <time: str>, <task: str>
e.g. set_reminder: 8:00 AM, Meeting with client
Sets a reminder for the specified time and task.
```
<｜User｜>Can you help be set a reminder?<｜Assistant｜>Thought: The user is

We can use the trianed model like above to get inference and build system at this point. But lets also see how to push the trained model for others to use as well.

We will also see how to convert model to GGUF format (with quantization) and use it through ollama.

I prefer pushing the merged model (Base + LoRA) to hf and then converting it from there as unsloth had issues at this point in time to save GGUF directly.

In [None]:
# Add your model id and token accordingly

model.push_to_hub_merged("yashagrawal/react_deepseek_1.5B", tokenizer, save_method = "merged_16bit", token = "")

Unsloth: You are pushing to hub, but you passed your HF username = yashagrawal.
We shall truncate yashagrawal/react_deepseek_1.5B to react_deepseek_1.5B
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.8G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.54 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 63.16it/s]


Unsloth: Saving tokenizer...

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

 Done.
Unsloth: Saving react_deepseek_1.5B/pytorch_model.bin...


README.md:   0%|          | 0.00/632 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/yashagrawal/react_deepseek_1.5B


In [None]:
# Can also use below to save the complete model locally if needed
# model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)

# GGUF Conversion

Once model is pushed to hf hub you can follow the steps here to convert it to GGUF: https://github.com/ggerganov/llama.cpp/discussions/2948

Or like me, you can use this space: https://huggingface.co/spaces/ggml-org/gguf-my-repo (it directly pushes the GGUF model version under your hf user for you)

In any of the above steps you can mention the level of quantization you want. Might as well quantize it using every possible method and see what gives best and fastest result for you since you alredy have trained the model anyway.

Once you have GGUF model pushed on hf, you can use it through ollama.

# Ollama inference and comparison between original and fine-tuned model output

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [None]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3)

In [None]:
!pip install ollama



# Inference from base model (without anyone  fine-tuning)

In [None]:
import ollama
from ollama import chat
from ollama import ChatResponse

ollama.pull('deepseek-r1:1.5b')

response: ChatResponse = chat(model='deepseek-r1:1.5b',
messages=[
  {
    "role": "system",
    "content": "You run in a loop of Thought, Action, Observation.\nAt the end of the loop, you output an Answer.\n\nUse Thought to describe your thoughts about the question you have been asked.\nUse Action to run one of the actions available to you - then wait for getting the observation of action in next message.\nObservation will be the result of running those actions.\n\nIf the tool requires arguments and the user has not provided them, **ask the user for the missing details** instead of assuming values. Do not make arbitrary selections when multiple options are available—ask the user to choose.\n\nYour available actions are:\n```\nfind_movie:\nusage - find_movie: <genre: str>\ne.g. find_movie: Comedy\nReturns a list of movies matching the given genre.\n\nbook_ticket:\nusage - book_ticket: <movie_id: str>, <time: str>, <seats: int>\ne.g. book_ticket: 5678, 8:00 PM, 2\nBooks tickets for the specified movie at the given time and number of seats.\n```\n\nExample session:\n```\nQuestion: I want to watch a movie. Thought: The user hasn't specified what genre they prefer. I need to ask for their preference. Answer: Sure! What genre of movie are you in the mood for?\n\nQuestion: I feel like watching a sci-fi movie. Thought: I should search for sci-fi movies. Action: find_movie: Sci-Fi\n\n<end message, action will be performed and observation will be sent in next message>\n```"
  },
  {
    "role": "user",
    "content": "I feel like watching a movie maybe."
  }
])

print(response.message.content)

<think>
Alright, the user is saying they feel like watching a movie maybe. Hmm, I need to figure out how to respond based on the available actions. They haven't specified a genre yet.

So, looking at the example, when someone searches for a genre, it returns movies matching that genre. So if they just want to find any movie without specifying a genre, I should probably ask them about their preference.

Wait, but what if they don't specify a genre? The system can't know unless they tell me. But since they mentioned maybe, perhaps the user is open to any genre. So in that case, the appropriate action would be to prompt for their favorite genre.

I think I should structure this by asking, "What genre of movie are you in the mood for?" Then let them choose. That way, it's flexible and covers their possibility.
</think>

Thought: The user feels like watching a movie but hasn't specified which one they prefer or what genre to look for.

Action: find_movie:

<end message, action will be perfo

# Inference from fine-tuned model with same input

In [None]:
import ollama
from ollama import chat
from ollama import ChatResponse

ollama.pull('hf.co/yashagrawal/react_deepseek_1.5B-Q8_0-GGUF:latest')

response: ChatResponse = chat(model='hf.co/yashagrawal/react_deepseek_1.5B-Q8_0-GGUF:latest',
messages=[
  {
    "role": "system",
    "content": "You run in a loop of Thought, Action, Observation.\nAt the end of the loop, you output an Answer.\n\nUse Thought to describe your thoughts about the question you have been asked.\nUse Action to run one of the actions available to you - then wait for getting the observation of action in next message.\nObservation will be the result of running those actions.\n\nIf the tool requires arguments and the user has not provided them, **ask the user for the missing details** instead of assuming values. Do not make arbitrary selections when multiple options are available—ask the user to choose.\n\nYour available actions are:\n```\nfind_movie:\nusage - find_movie: <genre: str>\ne.g. find_movie: Comedy\nReturns a list of movies matching the given genre.\n\nbook_ticket:\nusage - book_ticket: <movie_id: str>, <time: str>, <seats: int>\ne.g. book_ticket: 5678, 8:00 PM, 2\nBooks tickets for the specified movie at the given time and number of seats.\n```\n\nExample session:\n```\nQuestion: I want to watch a movie. Thought: The user hasn't specified what genre they prefer. I need to ask for their preference. Answer: Sure! What genre of movie are you in the mood for?\n\nQuestion: I feel like watching a sci-fi movie. Thought: I should search for sci-fi movies. Action: find_movie: Sci-Fi\n\n<end message, action will be performed and observation will be sent in next message>\n```"
  },
  {
    "role": "user",
    "content": "I feel like watching a movie maybe."
  }
])

print(response.message.content)

Thought: The user hasn't specified what genre they prefer. I need to ask for their preference. Answer: Sure! What genre of movie are you in the mood for?


Notice the difference in the output from both the models for exact same input. Base model does give thinking tokens and fine-tuned one still does not give any thinking tokens. Even though we added the thinking part in fine-tuning process. Maybe I am wrong in the way I placed it in the format? Not sure what exactly is the issue.

But also notice the correctness in output by fine-tuned model as per the expected output. Correctly stops where it should be stopping so we can parse this and present the output to the user. The power of fine-tuning!! Hope to see some cool agents with tool use using this model on local. The fine-tuned model is still not expected to be perfect. Might need more data points in the dataset and more tweaking of params during training to reach an even better model.