# Fine-Tuning Preparation
Based on the analysis, the confidence score is correlated with the number of labels predicted. This means that by increasing the number of predicted labels, the confidence score will increase as wel. However, this would only appply after the prediction.
The good news is that from the analysis, there are a few label types that have shown to have poor high confidence score ratio against low confidence score. This is while some low frequency labels have good ratio. So, to increase the confidence score with less label frequency. The training data needs to be augmented. Here, there are two things that can be done for this augmentation. These are:
- get contextual texts that corresponds to poor ratio labels.
- synthesize training data for rare texts.

------------
-----------
## Data Preparation

### Import Libraries

In [28]:
import pandas as pd
from gliner import GLiNER
import torch
import os
import json
from sklearn.model_selection import train_test_split
import gliner_finetune
from gliner_finetune.convert import convert
from gliner_finetune.train import train_model

### Data Preparation

In [29]:
# config
data_path = "training_data.json"  # Path to your training data
output_dir = "fine_tuned_model"   # Directory to save the fine-tuned model
batch_size = 4                    # Adjust based on your GPU memory
learning_rate = 2e-5              # Standard learning rate for fine-tuning
num_epochs = 10
project_name = "gliner_finetuning_project"

In [30]:
os.makedirs(output_dir, exist_ok=True)

In [31]:
# load data
with open(data_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

In [32]:
# Convert and split the data into training, validation, and testing datasets
convert(data, 
        project_path=project_name,
        train_split=0.8,
        eval_split=0.15,
        test_split=0.05,
        train_file='train.json',
        eval_file='eval.json',
        test_file='test.json',
        overwrite=True)



Data saved to gliner_finetuning_project\assets\test.json
Data saved to gliner_finetuning_project\assets\train.json
Data saved to gliner_finetuning_project\assets\eval.json




[{'tokenized_text': ['Piala',
   'Malaysia',
   'yang',
   'dianjurkan',
   'di',
   'Kuantan',
   'telah',
   'melakar',
   'sejarah',
   'kelapan',
   '.',
   'Kejohanan',
   'ke-6',
   'akan',
   'lebih',
   'meriah',
   '.'],
  'ner': []},
 {'tokenized_text': ['Komuniti',
   "Baha'i",
   'di',
   'George',
   'Town',
   'meraikan',
   'perayaan',
   'tradisi',
   'mereka',
   '.'],
  'ner': []},
 {'tokenized_text': ['Kumpulan',
   'Bajau',
   'mengadakan',
   'protes',
   'aman',
   'di',
   'pusat',
   'bandar',
   'Petaling',
   'Jaya',
   '.'],
  'ner': []},
 {'tokenized_text': ['Sambutan',
   'Konsert',
   'Amal',
   'Merdeka',
   'keempat',
   'di',
   'Kota',
   'Bharu',
   'menarik',
   'perhatian',
   'ramai',
   '.',
   'Ini',
   'adalah',
   'sambutan',
   'keenam',
   'mereka',
   '.'],
  'ner': []},
 {'tokenized_text': ['Perasmian',
   'ketujuhbelas',
   'Konvensyen',
   'Pendidikan',
   'Nasional',
   'di',
   'George',
   'Town',
   'berlangsung',
   'semalam',
   '.'

### Import Model

In [33]:
model = GLiNER.from_pretrained("urchade/gliner_multi")

Fetching 4 files: 100%|██████████| 4/4 [00:00<?, ?it/s]


### Training

In [34]:
train_model(
    model=model,
    train_data=os.path.join(project_name, "train.json"),
    eval_data=os.path.join(project_name, "eval.json"),
    project=project_name,
    output_dir=output_dir,
    batch_size=batch_size,
    lr=learning_rate,
    num_epochs=num_epochs
)

TypeError: train_model() got an unexpected keyword argument 'output_dir'

### Save Model

In [None]:
model.save_pretrained(output_dir)
print(f"Training complete! Model saved to: {output_dir}")

Training complete! Saved to: ..\fine_tuned_model


### Evaluation

In [None]:
test_path = os.path.join(project_name, "test.json")
if os.path.exists(test_path):
    print("\nEvaluating on test set...")
    results = model.evaluate(test_path, batch_size=batch_size)
    print(f"Test F1: {results['f1']:.4f}, Precision: {results['precision']:.4f}, Recall: {results['recall']:.4f}")
else:
    print("\nNo test set found for evaluation")


Evaluating on test set...


KeyError: 'ner'