<a href="https://colab.research.google.com/github/m-newhauser/fine-tune-qwen3/blob/main/nb/Gemma3_(4B)_grow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

Read our **[Qwen3 Guide](https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [None]:
# %%capture
# import os
# if "COLAB_" not in "".join(os.environ.keys()):
#     !pip install unsloth
# else:
#     # Do this only in Colab notebooks! Otherwise use pip install unsloth
#     !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
#     !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
#     !pip install --no-deps unsloth

In [1]:
%%capture
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
!pip install unsloth
!pip install evaluate

In [20]:
from google.colab import userdata

# Get HF token
HF_TOKEN = userdata.get('HF_TOKEN')

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

In [2]:
import pandas as pd
import numpy as np
import datasets

In [3]:
from unsloth import FastModel
import torch

# fourbit_models = [
#     # 4bit dynamic quants for superior accuracy and low memory use
#     "unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
#     "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
#     "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
#     "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",

#     # Other popular models!
#     "unsloth/Llama-3.1-8B",
#     "unsloth/Llama-3.2-3B",
#     "unsloth/Llama-3.3-70B",
#     "unsloth/mistral-7b-instruct-v0.3",
#     "unsloth/Phi-4",
# ] # More models at https://huggingface.co/unsloth

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.1: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


We now add LoRA adapters so we only need to update a small amount of parameters!

In [4]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.language_model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [5]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [6]:
from datasets import Dataset, load_dataset
# dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

df = (pd
      .read_csv("qwen3_raw_posts.csv")
      .fillna("")
      .assign(conversations=lambda df: df.apply(
          lambda row: np.array([
              {"from": "human", "value": row['sentence_summary']},
              {"from": "gpt", "value": row['raw_text']}
          ], dtype=object),
          axis=1
      ))
      .assign(source="linkedin")
      .query("author == 'Victoria'")
      .query("sentence_summary != ''")
      .drop(columns=["author", "url", "context", "hook", "length", "instructions_are_ai", "sentence_summary", "raw_text"])
      .reset_index(drop=True)
)

dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['conversations', 'source'],
    num_rows: 35
})

We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!

In [7]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/35 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [8]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"])
    return { "text" : texts }
pass
dataset = dataset.map(apply_chat_template, batched = True)

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [9]:
dataset[0]["text"]

'<bos><start_of_turn>user\nStart with a provocative statement suggesting CAG as a replacement for RAG, followed by a questioning remark. Introduce Cache-Augmented Generation (CAG) as a new question-answer method utilizing long-context LLMs to generate a document cache instead of retrieval, citing a paper with a link. Explain Step 1 of CAG: using a long-context LLM to generate a KV matrix as a document cache, briefly defining Keys (K) and Values (V) in the context of attention. Explain Step 2: using the query and KV cache to generate a response without reprocessing documents. List the benefits of this approach in bullet points (Efficiency, Memory Optimization, Simplified Architecture). List the drawbacks in bullet points (Context Window size limit, Updatability). Summarize the paper\'s results as looking good but mention several caveats in bullet points (no comparison to hybrid/SOTA dense retrieval, small datasets, no discussion of preloading time). Conclude with a personal opinion that

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [10]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation!
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/35 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [11]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/35 [00:00<?, ? examples/s]

In [12]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.57 GB of memory reserved.


Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`

In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 35 | Num Epochs = 8 | Total steps = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 14,901,248/4,000,000,000 (0.37% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,3.0314
2,3.0066
3,2.8945
4,2.8654
5,2.3319
6,2.3137
7,2.0376
8,2.2216
9,1.7183
10,1.7384


In [14]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

365.804 seconds used for training.
6.1 minutes used for training.
Peak reserved memory = 6.021 GB.
Peak reserved memory for training = 0.451 GB.
Peak reserved memory % of max memory = 40.845 %.
Peak reserved memory for training % of max memory = 3.059 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [15]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)
messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : "Write a LinkedIn post about insstruction fine-tuning.",
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

["<bos><start_of_turn>user\nWrite a LinkedIn post about insstruction fine-tuning.<end_of_turn>\n<start_of_turn>model\nThe prompt is still very important, but instruction tuning is one of the most important improvements that can affect performance. Instruction tuning is a technique that trains models to follow instructions from the prompt rather than completing tasks on their own. When applying this training, you will likely see a lot of improvements in a model's performance on"]

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [17]:
messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "Write a LinkedIn post about insstruction fine-tuning.",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 700, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

🧵👇

If you aren’t doing instruction fine-tuning, I’m not sure why 🤔

Instruction fine-tuning (IFT) has become a cornerstone of building applications that work as intended when they are not just predicting the next token. Instead, they are actually learning to execute a variety of tasks within the context of user instructions - this is often the difference between just following what it's been trained on vs truly understanding the reasoning behind it.

How does it work?

1️⃣ First, we train a model on large amounts of data - like text and code from internet datasets and private datasets.
2️⃣ Next, we carefully curate a set of prompt templates, each designed to elicit the desired type of response from the model. Each prompt template typically has a few key features:
   ✅ Clear instructions: These are the direct commands telling the model what to do.
   ✅ Inputs: This is the contextual information that the model needs to work with.
   ✅ Outputs: This specifies the desired format and style

In [24]:
blog_post_content = """
Exploring RAG and GraphRAG: Understanding when and how to use both

Retrieval Augmented Generation (RAG) is an effective way to get AI to extract information from the specific set of data you want it to work with. The idea is relatively simple - although generative LLMs are amazing at what they do, they don’t know everything. So if we want an LLM to generate a response based on specific information in our documents, we have to provide it with that information (context) first.
RAG is the solution to that problem, and has become pretty much ubiquitous for most knowledge base search systems we see out in the wild today. What more can you need? In this article we want to highlight that your data and what it looks like, as well as the valuable information in your data, may dictate what kind of RAG is most effective for it.
While in many cases the relevant context may be found in the content of our data, there are applications where additional information can help improve performance of a RAG application. Graph RAG, for example, allows for context to be retrieved based on relations between data points in our database. With the combination of vector search based RAG and Graph RAG in a hybrid RAG system, we can return results not only on their contextual meaning, but also based on the relationships within our data. To help you understand the difference between the two approaches, we’ve also created a recipe that you can run with Colab.

What is RAG & what is it good at
RAG stands for “Retrieval Augmented Generation”. Let’s zone in on the first word there: retrieval. The first step in getting an LLM to respond to something based on some specific context is to retrieve that relevant context in the first place.
What is Naive RAG?
Retrieving context can be done in many many ways, but by far the most common way is to do semantic search (vector search) over a given set of data. This brings us to the term “Naive RAG”, which is simply a basic question-answer system with vector search based retrieval. Within most RAG systems, the “R” (retrieval) is based on vector search. This allows us to use of the semantic meaning of a query and extract the most relevant data based on that meaning, using embedding models to encode both the user query and all the data we may have stored somewhere (vector databases like Weaviate that are designed to do just this).
Because of the fundamental nature of Naive RAG, it’s a great way of retrieving relevant context for any given query, which can then be used by an LLM to generate a response. Most datasets that include embeddings used for Naive RAG contain a list of “text” fields, and for each of them, we have an embedding:

An important thing to notice is that each entry is an independent entry. Each entry has meaning that can be represented by a vector (embedding). So, the only information Naive RAG has access to are the independent vectors for each entry. This way of representing data doesn’t represent any relationships between data points beyond the proximity of their meaning in vector space.
Take the example in our recipe. Here, we’ll be showcasing RAG over a dataset that includes (fake) contracts (such as partnerships, employment etc) that were signed between individuals and companies. For each contract, we have the contract_text, author and contract_type. We then go ahead and vectorize all of this information, where each contract has one vector representing its meaning.
When we ask a question about the data, it does a great job at fetching the most relevant contracts to the question we just asked.

Where Naive RAG is not enough
Now, in most cases the so-called relationships between data points may not be so relevant to any given search task. But with these contracts, you can probably already start to imagine that something that encodes relationships might be super valuable. For example, for each contract we retrieve, we know the author, but our retrieved context does not encode further information such as whether the person the author has signed a contract with has relationships to yet other authors. With that in mind, let’s move on to Graph RAG 👇
What is GraphRAG?
GraphRAG has recently become an umbrella term referring broadly to RAG approaches where the retrieval component specifically leverages knowledge graphs. Under this umbrella, numerous methods have emerged, each differing in how they utilize graph-based retrieval to enhance LLM responses (learn more here).
Among these, the GraphRAG implementation from Microsoft has risen as one of the most popular and widely-adopted approaches.

Microsoft's GraphRAG pipeline. Image from [Edge et al., 2024] licensed under CC BY 4.0.
Microsoft's GraphRAG (MS GraphRAG) enhances knowledge graph construction by leveraging an LLM in a two-stage process. In the initial stage, entities and relationships are extracted and summarized from source documents, laying the foundation for the knowledge graph, as depicted in the pipeline illustration above.
How GraphRAG Extends Naive RAG Capabilities
What sets MS GraphRAG apart from naive RAG is its ability to detect graph communities and generate domain-specific summaries for groups of closely related entities once the knowledge graph is constructed. This layered approach integrates fragmented information from various text sources into a cohesive and organized representation of entities, relationships, and communities.
The resulting entity- and community-level summaries can be used to provide relevant information in response to user queries within a RAG application. Additionally, the structured knowledge graph enables the application of multiple retrieval approaches, such as a combination of graph search and vector search together, enhancing the overall search and retrieval experience

Implementing GraphRAG with Neo4j
For this blog post, we've developed a streamlined Python project that encapsulates all the prompts to avoid overwhelming you with extensive code. While this implementation is a proof-of-concept rather than production-ready code, it provides a practical demonstration. You can easily initialize a Neo4j driver and pass it to this simplified Ms Graph RAG implementation to see the concepts in action.
Extracting Entities and Relations
We use the same dummy financial dataset as we used in the baseline RAG implementation. This dataset comprises 100 contracts involving various parties. For MS GraphRAG method, the most critical configuration decision involves specifying which entity types should be extracted and summarized, as this selection fundamentally shapes all downstream results. Given our focus on contracts, we prioritize the extraction of key entity categories including Person, Organization, and Location.
allowed_entities = ["Person", "Organization", "Location"]
await ms_graph.extract_nodes_and_rels(texts, allowed_entities)

After the results we should have the following results:

The purple node is the contract that contains its text and metadata, while the green nodes represent extracted entities. Each entity has a name and description, and they can have multiple relationships between each other, as shown in the above image.
Generating Community Summaries
When an entity is mentioned in multiple contracts, it will have multiple descriptions, as it gets one description per contract. Similarly, there can be multiple relationships between entities if they appear in multiple chunks. To consolidate the information, the implementation proceeds with entity and relationship summarization, where we use an LLM to generate concise summaries and resolve duplicates or redundant information.
await ms_graph.summarize_nodes_and_rels()

Results are:


The revised model now displays a single consolidated relationship between entities, containing summarized information from all input sources. Furthermore, each entity receives a comprehensive summary, which can be quite detailed, as evidenced by the extensive profile generated for Danny Williams.
In the final phase of the indexing process, we employ graph algorithms, specifically the Leiden algorithm, to identify communities within the network. These communities represent clusters of densely interconnected nodes that exhibit stronger connections among themselves than with the rest of the graph.

Communities are distinguished by entity color in this visualization. This illustrates how densely interconnected nodes naturally cluster to form communities.
The idea behind MS GraphRAG is to generate comprehensive high-level summaries that span multiple relationships and nodes. This provides a more holistic overview by synthesizing interconnected information into a cohesive picture.
await ms_graph.summarize_communities()

With a knowledge graph constructed, we can move onto the retrieval part.

Hybrid Local Graph & Vector Search
There are multiple effective methods for retrieving information from a knowledge graph. The Microsoft GraphRAG team demonstrates three distinct approaches:
1. Global search
2. Local search
3. DRIFT search
The local search approach generates responses by intelligently merging information from the AI-extracted knowledge graph with relevant text segments from the source documents. Local search is particularly effective for questions that require detailed understanding of specific entities or concepts documented in the corpus (e.g., "What therapeutic benefits does lavender oil provide?").
Local search is a retrieval and response generation method that works by finding the most relevant information in your document collection based on specific entities mentioned in a user's question. Here's how it works:
1. Entity Recognition: When a user asks a question, the system identifies key entities (people, places, concepts, etc.) that are semantically related to the query.
2. Knowledge Graph Navigation: These identified entities act as entry points into your knowledge graph, allowing the system to:
* Find connected entities (relationships)
* Extract relevant attributes and properties
* Pull in contextual information from community reports or other sources
After indexing our entities in Weaviate, we'll implement a retrieval pipeline that leverages both vector and graph databases. First, Weaviate's semantic search capabilities identify the most relevant entities based on the query's meaning. Then, we can use Neo4j's graph traversal capabilities to discover connected entities, relationships, and community structures, revealing both direct connections and broader contextual networks that might not be immediately apparent through vector search alone. This hybrid approach combines the semantic understanding of vector search with the relationship intelligence of graph databases for comprehensive information retrieval.
retriever = WeaviateNeo4jRetriever(driver=driver,
                                   client=client,
                                   collection="Entities",
                                   id_property_external="entity_id",
                                   id_property_neo4j="name",
                                   retrieval_query=retrieval_query
                                  )

First, we query the Weaviate vector database to identify relevant entities based on semantic similarity to the user's question. The retrieved entity IDs serve as linking points that we map to corresponding nodes within our Neo4j graph database.
Behind the scenes, the system then executes a Cypher query that traverses the knowledge graph, following relationships between entities and extracting contextually relevant information. The integration of both the semantic search capabilities of Weaviate and the relationship-oriented structure of Neo4j creates a retrieval system that understands both content and connections within your data. The retrieval query is:

This Cypher query traverses from the initial set of entities to their corresponding neighbors, communities, chunks, and more.
If we test on the same example about Weaviate, we get the following answer (Note: all of the data for this demo is generated 👍):
Weaviate is a corporation organized under the laws of both the State of
California and the State of Delaware. Its principal place of business is
primarily located in San Francisco, CA, with additional offices at 123
Innovation Drive, Tech City, CA, and 123 Tech Lane, Silicon Valley, CA.
The company is involved in a wide range of activities, including
consulting, software development, data analysis, cloud storage, technical
support, and project management services. Weaviate is actively engaged in
partnerships to develop innovative AI solutions and advanced data
processing technologies, contributing resources and expertise to these
collaborations.
....

Known Limitations of GraphRAG
MS GraphRAG offers more entity-centric indexing and retrieval compared to traditional RAG's chunk-based approach, providing richer entity and community descriptions. However, it faces challenges with static LLM-generated summaries that require periodic full reindexing to capture updates when new data comes in. This indexing pipeline can incur substantial token costs. In contrast, traditional RAG does not require any reindexing pipeline for summary generation when new data is added, allowing for more efficient updates. Additionally, scalability might become problematic with nodes having thousands of connections, and highly-connected generic entity types must be filtered to prevent skewed results. The comprehensive preprocessing required for summarization represents both a strength for detail and a limitation for maintaining current information.
Summary
While Naive RAG is a simple and effective starting point for retrieval-augmented generation—especially when your data is well-structured and self-contained—Graph RAG takes things a step further by understanding the relationships and context between entities. It’s particularly powerful when your data is rich in connections and interdependencies, like contracts, research papers, or organizational records. By combining both approaches in a hybrid system, you can leverage the best of semantic similarity and structural insights to deliver more nuanced, accurate, and insightful responses. Whether you're just getting started with RAG or looking to push the boundaries with GraphRAG, choosing the right strategy starts with understanding your data.

"""

messages = [
    {"role": "user", "content": [{"type": "text", "text": "Here is some context to use for the following task:"}]},
    {"role": "assistant", "content": [{"type": "text", "text": blog_post_content}]},
    {"role": "user", "content": [{"type": "text", "text": "Now, write a LinkedIn post about RAG and GraphRAG. with a CTA to the blog post provided as context."}]},
]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors="pt").to("cuda"),
    max_new_tokens=400,
    temperature=1.0, top_p=0.95, top_k=64,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

Retrieval Augmented Generation with GraphRAG: Extracting More Value From Your Data 🗺️📚
Retrieval Augmented Generation (RAG) is an effective way to get AI to extract information from the specific set of data you want it to work with. The idea is relatively simple - although generative LLMs are amazing at what they do, they don’t know everything. So if we want an LLM to generate a response based on specific information in our documents, we have to provide it with that information (context) first.
RAG is the solution to that problem, and has become pretty much ubiquitous for most knowledge base search systems we see out in the wild today. What more can you need? In this article we want to highlight that your data and what it looks like, as well as the valuable information in your data, may dictate what kind of RAG is most effective for it.
While in many cases the relevant context may be found in the content of our data, there are applications where additional information can help improve 

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [21]:
model.save_pretrained("m-newhauser/gemma-3-grow")  # Local saving
tokenizer.save_pretrained("m-newhauser/gemma-3-grow")
model.push_to_hub("m-newhauser/gemma-3-grow", token = HF_TOKEN) # Online saving
tokenizer.push_to_hub("m-newhauser/gemma-3-grow", token = HF_TOKEN) # Online saving

README.md:   0%|          | 0.00/602 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/59.7M [00:00<?, ?B/s]

Saved model to https://huggingface.co/m-newhauser/gemma-3-grow


tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastModel
    model, tokenizer = FastModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = True,
    )

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : "What is Gemma-3?",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 64, # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Okay, let's break down what Gemma-3 is. It's a fascinating development in the world of AI, and here's a comprehensive overview:

**1. What it is:**

* **A Family of Open-Weight Language Models:** Gemma-3 isn't just *one* model


### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
if False: # Change to True to save finetune!
    model.save_pretrained_merged("gemma-3-finetune", tokenizer)

If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload finetune
    model.push_to_hub_merged(
        "HF_ACCOUNT/gemma-3-finetune", tokenizer,
        token = "hf_..."
    )

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-finetune",
        quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-finetune",
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "HF_ACCOUNT/gemma-finetune-gguf",
        token = "hf_...",
    )

Now, use the `gemma-3-finetune.gguf` file or `gemma-3-finetune-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
