# Convert Fine-Tuned Gemma Model to GGUF

This notebook takes a fine-tuned Gemma model (base + LoRA adapter), merges them, and converts the final model to the GGUF format for efficient CPU-based inference. It also saves the final GGUF file to Google Drive and uploads it to a Hugging Face repository.

## Step 1: Setup and Environment

First, we mount Google Drive to access our saved model adapter. Then, we install the necessary libraries. `llama-cpp-python` is essential for the conversion process.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install -q -U torch transformers bitsandbytes peft accelerate huggingface_hub
!pip install -q -U llama-cpp-python --no-cache-dir

## Step 2: Load, Merge, and Save the Fine-Tuned Model

Here, we load the original `gemma-2b-it` base model, then load the LoRA adapter we trained and saved to Google Drive. We merge the adapter weights into the base model to create our full fine-tuned model. Finally, we save this merged model to a local directory in this Colab session so the conversion script can access it.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import os

# --- Configuration ---
base_model_id = "google/gemma-2b-it"
# This must match the output_dir from the training notebook
adapter_folder_name = "gemma-2b-it-rutooro-A100"
adapter_path = f"/content/drive/MyDrive/{adapter_folder_name}/final_adapter"

# Path for the temporarily saved merged model
merged_model_dir = "/content/merged_model"

# --- Load Base Model ---
print(f"Loading base model: {base_model_id}")
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16, # Use the same dtype as training for consistency
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# --- Load and Merge LoRA Adapter ---
print(f"Loading adapter from: {adapter_path}")
# Load the PEFT model by combining the base model with the adapter
model = PeftModel.from_pretrained(base_model, adapter_path)

# Merge the adapter weights into the base model
print("Merging adapter into the base model...")
model = model.merge_and_unload()
print("Merge complete.")

# --- Save Merged Model ---
print(f"Saving merged model to: {merged_model_dir}")
os.makedirs(merged_model_dir, exist_ok=True)
model.save_pretrained(merged_model_dir)
tokenizer.save_pretrained(merged_model_dir)
print("Merged model saved successfully.")

## Step 3: Convert to GGUF

Now we use the conversion script from the `llama.cpp` repository to transform our saved model into a GGUF file. We will use the `Q4_K_M` quantization type, which provides a good balance between model size and performance.

In [None]:
# Clone the llama.cpp repository
!git clone https://github.com/ggerganov/llama.cpp.git

In [None]:
# Define paths
gguf_model_name = "rutooro-gemma-2b-it.q4_k_m.gguf"
gguf_output_path = f"/content/{gguf_model_name}"
conversion_script_path = "/content/llama.cpp/convert.py"

# Run the conversion script
!python {conversion_script_path} {merged_model_dir} \
  --outfile {gguf_output_path} \
  --outtype q4_k_m

print(f"GGUF model created at: {gguf_output_path}")

## Step 4: Save GGUF to Google Drive

Now that the GGUF file is created, we'll copy it to your Google Drive for safekeeping.

In [None]:
# Define destination path in Google Drive
drive_gguf_path = f"/content/drive/MyDrive/{adapter_folder_name}/{gguf_model_name}"

# Copy the file
!cp {gguf_output_path} {drive_gguf_path}

print(f"GGUF file saved to your Google Drive at: {drive_gguf_path}")

## Step 5: Upload GGUF to Hugging Face Hub

Finally, we'll upload the GGUF model to your Hugging Face repository. This makes it easily accessible for others to download and use with tools like `llama.cpp`.

In [None]:
from huggingface_hub import HfApi, login

# --- Configuration ---
hf_repo_id = "cle-13/gemma-2b-it-rutooro-A100"

# --- Login to Hugging Face ---
# You will need a Hugging Face token with 'write' permissions
print("Please log in to Hugging Face...")
login()

# --- Upload the File ---
print(f"Uploading {gguf_model_name} to {hf_repo_id}...")
api = HfApi()
api.upload_file(
    path_or_fileobj=gguf_output_path,
    path_in_repo=gguf_model_name,
    repo_id=hf_repo_id,
    repo_type="model"
)

print("Upload complete!")