# Base models Qwen3-06B and Gemini API


This notebook is a **baseline evaluation** to assess the performance of pre-trained models (`Qwen3-065B` and `gemini-2.0-flash-exp` API) on the real estate price prediction task. 

The notebook performs the following steps:

1. Load dataset from remote (Kaggle/Hugging Face).
2. Preprocess dataset.
3. Evaluate **pre-trained base** `Qwen\Qwen3-0.6B` model using regression metrics.
4. Evaluate Gemini **API** using regression metrics.
5. Evaluate the performance of the **fine-tuned** `Qwen-Lora-Estate` model using regression metrics
6. Save and compare results.

---
## Setup
---

### **Install Dependencies**

In [10]:
# !pip install -qU transformers wandb google-generativeai huggingface_hub[hf_xet]
!pip install -qU  json_repair

### **Import Dependencies**

In [None]:
import os
import json
import torch
import json_repair
import pandas as pd
from IPython.display import JSON
from tqdm import tqdm

from transformers import AutoModelForCausalLM, AutoTokenizer
import google.generativeai as genai

# my custom modules
from utils import logging_config, evaluate_model, timeit
from fine_tuning_helpers import apply_prompt_template, decode_response

In [22]:
os.environ['LOGS'] = '/kaggle/working/logs'
os.environ['RESULTS'] = '/kaggle/working/results'

os.makedirs(os.environ['LOGS'], exist_ok=True )
os.makedirs(os.environ['RESULTS'], exist_ok=True )

LOGGER = logging_config(log_dir=os.environ['LOGS'])
LOGGER

<_Logger utils (INFO)>

### **Define Tokens and Authenticate**

In [None]:
# If using kaggle 
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGINGFACEHUB_API_TOKEN")
gemini_token = user_secrets.get_secret("GOOGLE_API_KEY")

# uncomment if using colab
# from google.colab import userdata
# hf_token = userdata.get("HUGGINGFACEHUB_API_TOKEN")
# gemini_token = userdata.get("GOOGLE_API_KEY")

In [19]:
from huggingface_hub import whoami, login

# !huggingface-cli login --token {mytoken} # another method
login(token = hf_token)
genai.configure(api_key=gemini_token)
# JSON(whoami())

In [20]:
# uncomment if using colab
# import kagglehub
# kagglehub.login(validate_credentials=True)

---
## Load Dataset
---

### Read data from remote (Kaggle/Hugging Face)

**Download the dataset from Kaggle**

In [None]:
# # Uncomment if using colab

# kagglehub.dataset_download('hebamo7amed/real-estate-data-for-llm-fine-tuning')
# tabular_data_path = f"{data_path}/tabular_data"
# text_data_path = f"{data_path}/text_data"
# text_data_path

**Read Text Datasets**

In [None]:
# with open(f"{text_data_path}/text_train_data.jsonl", "r") as f:
#   train_data = json.load(f)

# with open(f"{text_data_path}/text_val_data.jsonl", "r") as f:
#   val_data = json.load(f)

# with open(f"{text_data_path}/sample_50.jsonl", "r") as f:
#   sample_data = json.load(f)

# print("Training data size = ", len(train_data))
# print("Validation data size = ", len(val_data))
# print("Sample data size = ", len(sample_data))

### **Load Dataset Sample from hugging Face Hub**

A data sample that was created from structured real estate data and uploaded to Hugging Face in the first notebook. It is formatted for ion-based fine-tuning an LLM.

In [7]:
from datasets import load_dataset
dataset = load_dataset(
    path  ='heba1998/real-estate-data-sample-for-llm-fine-tuning'
)
dataset

README.md:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

llm_train_data.jsonl:   0%|          | 0.00/5.54M [00:00<?, ?B/s]

llm_val_data.jsonl:   0%|          | 0.00/222k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['system', 'instruction', 'input', 'output', 'history'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['system', 'instruction', 'input', 'output', 'history'],
        num_rows: 200
    })
})

In [None]:
# Convert data to list of jsons or ``jsonl``
val_data = [sample for sample in dataset['validation']]
house_price = lambda sample : json_repair.loads(sample['output'])['estimated_house_price']
true_labels = [ house_price(sample) for sample in dataset['validation']]

---
## Evaluate Responses of `gemini-2.0-flash` Model
---

Evaluate the responses from Gemini API using regression metrics.

### **Helper Function to get responses from `genai` SDK API**

This function uses another function from the `fine_tuning_helpers.py` and `utils` utility scripts.

In [None]:
from fine_tuning_helpers import extract_house_price

@timeit
def batch_api_generate(model, data):
    llm_predictions = []
    tokens_history = []
    bar_format = '{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]'
    
    for idx, sample in enumerate(tqdm(data, total=len(data),
                                 desc="Get response from pre-trained `Qwen3-0.6B` model",
                                 ncols=100, colour='green')):
        LOGGER.info("-"*50)
        LOGGER.info(f"Sample id {idx}")
        LOGGER.info("-"*50)
        # 1. PREPROCESSING: 
        # build the system and user prompt with the model chat template
        prompt= apply_prompt_template(sample)

        # 2. GENERATION: Generate response
        response = model.generate_content(prompt)
        text_response = response.candidates[0].content.parts[0].text

        # 4. POSTPROCESSING:  return only the output tokens and  exclude input ids
        LOGGER.info(f"Response text for sample id={idx} \n {text_response}" )
        
        # Clean response
        house_price = extract_house_price(text_response)
        LOGGER.info(f"\t>> Predicted price: {house_price}")
        llm_predictions.append(int(house_price))
        
        # Store metadata for expenses calculation
        tokens_history.append({
                'id': idx,
                'input_tokens': response.usage_metadata.prompt_token_count, 
                'output_tokens': response.usage_metadata.candidates_token_count,
                'total_tokens': response.usage_metadata.total_token_count
            })
            
        LOGGER.debug(f"Updated tokens history for {idx}")

    return llm_predictions, tokens_history

In [27]:
list(model.name for model in genai.list_models())

['models/chat-bison-001',
 'models/text-bison-001',
 'models/embedding-gecko-001',
 'models/gemini-1.0-pro-vision-latest',
 'models/gemini-pro-vision',
 'models/gemini-1.5-pro-latest',
 'models/gemini-1.5-pro-001',
 'models/gemini-1.5-pro-002',
 'models/gemini-1.5-pro',
 'models/gemini-1.5-flash-latest',
 'models/gemini-1.5-flash-001',
 'models/gemini-1.5-flash-001-tuning',
 'models/gemini-1.5-flash',
 'models/gemini-1.5-flash-002',
 'models/gemini-1.5-flash-8b',
 'models/gemini-1.5-flash-8b-001',
 'models/gemini-1.5-flash-8b-latest',
 'models/gemini-1.5-flash-8b-exp-0827',
 'models/gemini-1.5-flash-8b-exp-0924',
 'models/gemini-2.5-pro-exp-03-25',
 'models/gemini-2.5-pro-preview-03-25',
 'models/gemini-2.5-flash-preview-04-17',
 'models/gemini-2.5-flash-preview-04-17-thinking',
 'models/gemini-2.5-pro-preview-05-06',
 'models/gemini-2.0-flash-exp',
 'models/gemini-2.0-flash',
 'models/gemini-2.0-flash-001',
 'models/gemini-2.0-flash-lite-001',
 'models/gemini-2.0-flash-lite',
 'models

### **Generate Response using Gemini.**

Generate responses from `gemini-2.0-flash-exp` in the validation dataset.

In [52]:
import google.generativeai as genai
import numpy as np

Configs = genai.GenerationConfig(max_output_tokens=200)

model_id = "models/gemini-2.0-flash-exp"
gemini = genai.GenerativeModel(model_name=model_id,
                               generation_config=Configs)

gemini_preds, gemini_tokens_history = batch_api_generate(gemini, data = val_data)

Get response from pre-trained `Qwen3-0.6B` model: 100%|[32m███████████[0m| 200/200 [02:27<00:00,  1.36it/s][0m


 Data completed in 2.45 minutes.





### **Evaluation Metrics for `gemini-2.0-flash-exp` Model**

This function is implemented in the `utils.py` utility script.

In [53]:
print("Actual Price", true_labels[:10])
print("Predicted Prices", gemini_preds[:10])

Actual Price [2500000.0, 295000.0, 299900.0, 699000.0, 239000.0, 11000.0, 470000.0, 449000.0, 250000.0, 339000.0]
Predicted Prices [-1, -1, 345000, -1, -1, -1, -1, -1, -1, -1]


> **`-1`** indecate to that gemini didn't produce the result.

**Evaluation Metrics and Predictions for `gemini-2.0-flash-exp` Model**

In [None]:
gemini_metrics = evaluate_model(true_labels, gemini_preds)

gemini_metrics["n_samples"] = len(val_data)
gemini_metrics["eval_time (min)"] = 2.45
gemini_metrics["response_time (min)"] = 2.45 / len(val_data)
gemini_metrics["eval_device"] = "remote-api"
gemini_metrics["model_name"] = model_id.split('/')[-1]

JSON(gemini_metrics)

In [None]:
gemini_results = pd.DataFrame(gemini_tokens_history)
gemini_results['y_actual'] = true_labels
gemini_results['y_pred'] = gemini_preds

missed_prec = len(gemini_results[gemini_results['y_pred']==-1]) *100 /len(val_data)
gemini_metrics["missing_pred(%)"] = missed_prec

print(f">>>>>> Gemini can't predict {missed_prec}% from the given data <<<<<<")
gemini_results.head(5)

>>>>>> Gemini can't predict 72.0% from the given data <<<<<<


Unnamed: 0,id,input_tokens,output_tokens,total_tokens,y_actual,y_pred
0,0,257,78,335,2500000.0,-1
1,1,258,77,335,295000.0,-1
2,2,256,104,360,299900.0,345000
3,3,258,78,336,699000.0,-1
4,4,258,78,336,239000.0,-1


**Save `gemini-2.0-flash-exp` Results**

In [None]:
with open(f"{os.environ['RESULTS']}/gemini_metrics.json", 'w') as json_file:
    json.dump(gemini_metrics, json_file, indent=4)

gemini_results.to_csv(f"{os.environ['RESULTS']}/gemini_results.csv", index=False)

---
## Evaluate Responses from Pre-Trained Base LM `Qwen3-0.6B`
---

### **Helper Function to get responses for base model**

This function uses another function from the `fine_tuning_helpers.py` and `utils` utility scripts.

In [29]:
@timeit
def batch_generate(model, tokenizer, data, device):
    predictions = []
    tokens_history = []
    bar_format = '{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]'
    
    for idx, sample in enumerate(tqdm(data, total=len(data),
                                 desc="Get response from pre-trained `Qwen3-0.6B` model",
                                 ncols=100, colour='green')):
        # 1. PREPROCESSING: 
        # build the system and user prompt with the model chat template
        prompt = apply_prompt_template(sample)

        # 2. TOKENIZATION: Tokenize the text prompt message
        inputs = tokenizer([prompt], return_tensors="pt").to(device)
        n_input_tokens = len(inputs.input_ids[0])
        LOGGER.info(f"\t>> Tokenized to {n_input_tokens} tokens")
        
        # 3. GENERATION: Generate response
        response_tokens_ids = model.generate(
            inputs=inputs.input_ids,
            attention_mask=inputs.attention_mask,
        )
        n_output_tokens = len(response_tokens_ids[0])

        # 4. POSTPROCESSING: Return only the output tokens and  exclude input ids and then clean response
        response_text = decode_response(response_tokens_ids, inputs.input_ids, tokenizer)
        response_dict = json_repair.loads(response_text)            
        try:
            house_price = response_dict["estimated_house_price"]
        except:
            try:
                house_price = response_dict[0]["estimated_house_price"]
            except: 
                print(response_dict)
                house_price = -1
        predictions.append(int(house_price))
        
        # Store BAse model metadata for cost calculation
        tokens_history.append({
                'id': idx,
                'input_tokens': n_input_tokens, 
                'output_tokens': n_output_tokens,
                'total_tokens': n_input_tokens + n_output_tokens
            })
            
    return predictions, tokens_history

### **Load `Qwen3-0.6B` from Hugging Face**

In [None]:
model_id = "Qwen/Qwen3-0.6B"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the model and tokenizer
tokenizer_qwen = AutoTokenizer.from_pretrained(model_id)
model_qwen = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=None).to(device)

### **Generate Response using Pre-trained `Qwen3-0.6B`**

Generate responses from the base model `Qwen3-0.6B` in the validation dataset.

In [None]:
# Make batch prediction or text to number genration 
base_qwen_preds, base_qwen_tokens_history = batch_generate(model=model_qwen, 
                                                  tokenizer=tokenizer_qwen, 
                                                  data = val_data,
                                                  device = device)

> Prediction for 200 sample takes 2.82 minutes using pretrained `Qwen3-0.6B`.

### **Evaluation metrics for Pre-trained `Qwen3-0.6B`**

This function is implemented in the `utils.py` utility script.

**Predictions sample**

In [None]:
print("Actual Price", true_labels[:10])
print("Predicted Prices", base_qwen_preds[:10])

Actual Price [2500000.0, 295000.0, 299900.0, 699000.0, 239000.0, 11000.0, 470000.0, 449000.0, 250000.0, 339000.0]
Predicted Prices [55000, 128000, 85000, 85000, 85000, 85000, 85000, 85000, 85000, 85000]


**Evaluation Metrics Results for Pre-trained `Qwen3-0.6B`**

In [None]:
base_qwen_metrics = evaluate_model(true_labels, base_qwen_preds)

base_qwen_metrics["n_samples"] = len(val_data)
base_qwen_metrics["eval_time (min)"] = 2.82
base_qwen_metrics["response_time (min)"] = 2.82 / len(val_data)
base_qwen_metrics["eval_device"] = "gpu-t4x2"
base_qwen_metrics["model_name"] = model_id.split('/')[-1]

JSON(base_qwen_metrics)

> Model produce bad result and don't follow the output schema

> Model predict the same value each time `85000` or `8500` because it is bais towards the example of the schema.

In [None]:
base_qwen_results = pd.DataFrame(base_qwen_tokens_history )
base_qwen_results['y_actual'] = true_labels
base_qwen_results['y_pred'] = base_qwen_preds

missed_prec = len(base_qwen_results[base_qwen_results['y_pred']==-1]) * 100 /len(val_data)
base_qwen_metrics["missing_pred(%)"] = missed_prec

print(f">>>>>> Base Qwen can't predict {missed_prec}% from the given data <<<<<<")
base_qwen_results 

**Save Pre-trained`Qwen3-0.6B` Results**

In [None]:
with open(f"{os.environ['RESULTS']}/pretrained_qwen_metrics.json", 'w') as json_file:
    json.dump(base_qwen_metrics, json_file, indent=4)

base_qwen_results.to_csv(f"{os.environ['RESULTS']}/pretrained_qwen_results.csv", index=False)

---
## Evaluate Responses from Fine-Tuned `Qwen3-0.6B`
---

### **Helper Function to get responses for base model**

This function uses another function from the `fine_tuning_helpers.py` and `utils` utility scripts.

### **Load our Adaptor `Qwen-LoRA-Estate` from Hugging Face**
Load the fine-tuned model from Hugging Face. The model is trained on the same dataset and schema as the base model.

In [None]:
from peft import PeftModel

adaptor_id = "heba1998/Qwen-LoRA-Estate"

# Load the LoRA adapter
peft_model =  PeftModel.from_pretrained(model_qwen, adaptor_id)

# Attach the LoRA adapter to the model
qwen_lora_estate = peft_model.merge_and_unload()

### **Generate Response using Fine-Tuned `Qwen3-0.6B`.**

Generate responses from the Fine-Tuned `Qwen3-0.6B` in the validation dataset.

In [14]:
# Make batch prediction or text to number genration
qwen_lora_estate_preds, Qwen_lora_estate_tokens_history = batch_generate(model=qwen_lora_estate,
                                                            tokenizer=tokenizer_qwen,
                                                            data=val_data,
                                                            device=device)

Get response from pre-trained `Qwen3-0.6B` model: 100%|[32m███████████[0m| 200/200 [03:00<00:00,  1.11it/s][0m


 Data completed in 3.01 minutes.





> Prediction for 200 sample takes 3.01 minutes using our fine-tined lora qwen model `Qwen-lora-estate`.

### **Evaluation metrics for Fine-Tuned `Qwen3-0.6B`**

This function is implemented in the `utils.py` utility script.

In [15]:
print("Actual Price", true_labels[:10])
print("Predicted Prices", qwen_lora_estate_preds[:10])

Actual Price [2500000.0, 295000.0, 299900.0, 699000.0, 239000.0, 11000.0, 470000.0, 449000.0, 250000.0, 339000.0]
Predicted Prices [1100000, 269000, 315000, 725000, 125000, 140000, 411000, 499900, 319000, 449900]


**Evaluation Metrics and Predictions for our `Qwen-lora-estate` Model**

In [None]:
qwen_lora_estate_metrics = evaluate_model(true_labels, qwen_lora_estate_preds)

qwen_lora_estate_metrics["n_samples"] = len(val_data)
qwen_lora_estate_metrics["eval_time (min)"] = 2.82
qwen_lora_estate_metrics["response_time (min)"] = 2.82 / len(val_data)
qwen_lora_estate_metrics["eval_device"] = "gpu-t4x2"
qwen_lora_estate_metrics["model_name"] = model_id.split('/')[-1]

JSON(qwen_lora_estate_metrics)

In [None]:
Qwen_lora_estate_results = pd.DataFrame(Qwen_lora_estate_tokens_history)
Qwen_lora_estate_results['y_actual'] = true_labels
Qwen_lora_estate_results['y_pred'] = qwen_lora_estate_preds

missed_prec = len(Qwen_lora_estate_results[Qwen_lora_estate_results['y_pred']==-1]) * 100 /len(val_data)
qwen_lora_estate_metrics["missing_pred(%)"] = missed_prec

print(f">>>>>> Base Qwen can't predict {missed_prec}% from the given data <<<<<<")

Qwen_lora_estate_results 

**Save Results for our `Qwen-lora-estate` Model**

In [None]:
with open(f"{os.environ['RESULTS']}/qwen_lora_estate_metrics.json", 'w') as json_file:
    json.dump(qwen_lora_estate_metrics, json_file, indent=4)

Qwen_lora_estate_results.to_csv(f"{os.environ['RESULTS']}/qwen_lora_estate_results.csv", index=False)

In [None]:
# to be downolad for further comparasions
!zip /kaggle/working/results_200.zip /kaggle/working/results/*

In [None]:
json_files = []

for root, dirs, files in os.walk('results'):
    for file in files:
        if file.endswith('.json'):
            json_files.append(file)

print(json_files)

['gemini_metrics.json', 'pretrained_qwen_metrics.json', 'qwen_lora_estate_metrics.json']


In [None]:
results = []

for filepath in json_files:
    with open(f"{os.environ['RESULTS']}/{filepath}", 'r') as f:
        results.append(json.load(f))

df_results = pd.DataFrame(results)
df_results.set_index('model_name', inplace=True)

df_results.to_csv(f"{os.environ['RESULTS']}/results_comparisons.csv", index=False)
df_results