# PhoBERT NER - Training Model

Notebook n√†y fine-tune PhoBERT cho task Named Entity Recognition (NER) tr√™n d·ªØ li·ªáu y t·∫ø ti·∫øng Vi·ªát.

## Model
- Pretrained: `vinai/phobert-base` (135M parameters)
- Task: Token Classification (NER)
- Labels: B-SYMPTOM, I-SYMPTOM, B-DISEASE, I-DISEASE, O

In [None]:
import json
import numpy as np
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)
from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
import warnings
warnings.filterwarnings('ignore')

print(f"‚úì PyTorch version: {torch.__version__}")
print(f"‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úì GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è  Ch·∫°y tr√™n CPU (s·∫Ω ch·∫≠m h∆°n)")

  from .autonotebook import tqdm as notebook_tqdm


‚úì PyTorch version: 2.9.1+cpu
‚úì CUDA available: False
‚ö†Ô∏è  Ch·∫°y tr√™n CPU (s·∫Ω ch·∫≠m h∆°n)


## 1. Load d·ªØ li·ªáu

In [None]:
# ƒê·ªçc training data
with open("../../data/processed/train_phobert.json", "r", encoding="utf-8") as f:
    train_data = json.load(f)

# ƒê·ªçc validation data
with open("../../data/processed/val_phobert.json", "r", encoding="utf-8") as f:
    val_data = json.load(f)

print(f"‚úì Training: {len(train_data)} c√¢u")
print(f"‚úì Validation: {len(val_data)} c√¢u")

print("\nV√≠ d·ª•:")
print(train_data[0])

‚úì Training: 320 c√¢u
‚úì Validation: 80 c√¢u

V√≠ d·ª•:
{'tokens': ['Vi√™m', 'tai', 'gi·ªØa'], 'ner_tags': ['B-DISEASE', 'I-DISEASE', 'I-DISEASE']}


## 2. T·∫°o Label Mapping

In [None]:
# T·ª± ƒë·ªông t√¨m t·∫•t c·∫£ labels t·ª´ data
all_labels = set()
for item in train_data + val_data:
    all_labels.update(item['ner_tags'])

# S·∫Øp x·∫øp labels (O lu√¥n ·ªü ƒë·∫ßu)
label_list = sorted(list(all_labels))
if 'O' in label_list:
    label_list.remove('O')
    label_list = ['O'] + label_list

# T·∫°o mapping
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for label, i in label2id.items()}

print(f"‚úì S·ªë l∆∞·ª£ng labels: {len(label_list)}")
print(f"‚úì Labels: {label_list}")
print(f"\nLabel mapping:")
for label, idx in label2id.items():
    print(f"  {label} ‚Üí {idx}")

‚úì S·ªë l∆∞·ª£ng labels: 5
‚úì Labels: ['O', 'B-DISEASE', 'B-SYMPTOM', 'I-DISEASE', 'I-SYMPTOM']

Label mapping:
  O ‚Üí 0
  B-DISEASE ‚Üí 1
  B-SYMPTOM ‚Üí 2
  I-DISEASE ‚Üí 3
  I-SYMPTOM ‚Üí 4


## 3. Chuy·ªÉn ƒë·ªïi sang Hugging Face Dataset

In [None]:
def convert_tags_to_ids(data, label2id):
    """Chuy·ªÉn text labels th√†nh IDs"""
    converted = []
    for item in data:
        tag_ids = [label2id[tag] for tag in item['ner_tags']]
        converted.append({
            'tokens': item['tokens'],
            'ner_tags': tag_ids
        })
    return converted

train_data_converted = convert_tags_to_ids(train_data, label2id)
val_data_converted = convert_tags_to_ids(val_data, label2id)

# T·∫°o Dataset objects
train_dataset = Dataset.from_dict({
    'tokens': [item['tokens'] for item in train_data_converted],
    'ner_tags': [item['ner_tags'] for item in train_data_converted]
})

val_dataset = Dataset.from_dict({
    'tokens': [item['tokens'] for item in val_data_converted],
    'ner_tags': [item['ner_tags'] for item in val_data_converted]
})

print("‚úì ƒê√£ t·∫°o Hugging Face Datasets")
print(f"  Train: {len(train_dataset)} samples")
print(f"  Val: {len(val_dataset)} samples")

‚úì ƒê√£ t·∫°o Hugging Face Datasets
  Train: 320 samples
  Val: 80 samples


## 4. Load PhoBERT Tokenizer & Model

In [None]:
model_checkpoint = "vinai/phobert-base"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False)
print("‚úì ƒê√£ load tokenizer")

# Load model
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)
print("‚úì ƒê√£ load PhoBERT model")
print(f"  Parameters: {model.num_parameters():,}")

‚úì ƒê√£ load tokenizer


Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at vinai/phobert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úì ƒê√£ load PhoBERT model
  Parameters: 134,411,525


## 5. Tokenize v√† Align Labels

In [None]:
def tokenize_and_align_labels(examples):
    """
    Tokenize text v√† align labels.
    PhoBERT tokenizer kh√¥ng h·ªó tr·ª£ word_ids(), n√™n ta ph·∫£i x·ª≠ l√Ω th·ªß c√¥ng.
    """
    tokenized_inputs = {
        "input_ids": [],
        "attention_mask": [],
        "labels": []
    }
    
    for tokens, labels in zip(examples["tokens"], examples["ner_tags"]):
        # Tokenize t·ª´ng token ri√™ng l·∫ª
        tokenized_tokens = []
        aligned_labels = []
        
        for token, label in zip(tokens, labels):
            # Tokenize token hi·ªán t·∫°i
            token_ids = tokenizer.encode(token, add_special_tokens=False)
            tokenized_tokens.extend(token_ids)
            
            # Label cho token ƒë·∫ßu ti√™n, -100 cho c√°c subword
            aligned_labels.append(label)
            aligned_labels.extend([-100] * (len(token_ids) - 1))
        
        # Th√™m special tokens [CLS] v√† [SEP]
        input_ids = [tokenizer.cls_token_id] + tokenized_tokens + [tokenizer.sep_token_id]
        attention_mask = [1] * len(input_ids)
        labels_with_special = [-100] + aligned_labels + [-100]
        
        # Truncate n·∫øu qu√° d√†i
        max_length = 256
        if len(input_ids) > max_length:
            input_ids = input_ids[:max_length]
            attention_mask = attention_mask[:max_length]
            labels_with_special = labels_with_special[:max_length]
        
        tokenized_inputs["input_ids"].append(input_ids)
        tokenized_inputs["attention_mask"].append(attention_mask)
        tokenized_inputs["labels"].append(labels_with_special)
    
    return tokenized_inputs

# Tokenize datasets
print("üîÑ Tokenizing data...")
tokenized_train = train_dataset.map(tokenize_and_align_labels, batched=True, remove_columns=train_dataset.column_names)
tokenized_val = val_dataset.map(tokenize_and_align_labels, batched=True, remove_columns=val_dataset.column_names)
print("‚úì Ho√†n th√†nh tokenization")

# Ki·ªÉm tra 1 sample
print("\nüîç Ki·ªÉm tra tokenization:")
sample = tokenized_train[0]
print(f"  Input IDs length: {len(sample['input_ids'])}")
print(f"  Labels length: {len(sample['labels'])}")
print(f"  Attention mask length: {len(sample['attention_mask'])}")

üîÑ Tokenizing data...


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 320/320 [00:00<00:00, 1006.19 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:00<00:00, 1594.55 examples/s]

‚úì Ho√†n th√†nh tokenization

üîç Ki·ªÉm tra tokenization:
  Input IDs length: 5
  Labels length: 5
  Attention mask length: 5





## 6. Data Collator

In [None]:
# Data collator t·ª± ƒë·ªông padding batch
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
print("‚úì ƒê√£ t·∫°o data collator")

‚úì ƒê√£ t·∫°o data collator


## 7. Metrics Computation

In [None]:
def compute_metrics(eval_preds):
    """
    T√≠nh precision, recall, F1 score cho NER task.
    S·ª≠ d·ª•ng seqeval metrics.
    """
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Chuy·ªÉn predictions v√† labels th√†nh text labels
    true_labels = []
    true_predictions = []

    for prediction, label in zip(predictions, labels):
        true_label = []
        true_pred = []
        for pred, lab in zip(prediction, label):
            if lab != -100:  # B·ªè qua padding v√† subword tokens
                true_label.append(id2label[lab])
                true_pred.append(id2label[pred])
        true_labels.append(true_label)
        true_predictions.append(true_pred)

    # T√≠nh metrics
    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
    }

print("‚úì ƒê√£ ƒë·ªãnh nghƒ©a metrics function")

‚úì ƒê√£ ƒë·ªãnh nghƒ©a metrics function


## 8. Training Arguments

In [None]:
training_args = TrainingArguments(
    output_dir="../../models/phobert_ner_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="../models/logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False,
    save_total_limit=2  # Ch·ªâ gi·ªØ 2 checkpoints t·ªët nh·∫•t
)

print("‚úì Training configuration:")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Weight decay: {training_args.weight_decay}")

‚úì Training configuration:
  Learning rate: 2e-05
  Batch size: 8
  Epochs: 5
  Weight decay: 0.01


## 9. Initialize Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("‚úì ƒê√£ kh·ªüi t·∫°o Trainer")

‚úì ƒê√£ kh·ªüi t·∫°o Trainer


## 10. Train Model üöÄ

In [None]:
print("\nüöÄ B·∫Øt ƒë·∫ßu training...\n")
print("=" * 60)

train_result = trainer.train()

print("\n" + "=" * 60)
print("‚úÖ Ho√†n th√†nh training!")
print(f"\nTraining metrics:")
print(f"  Train loss: {train_result.training_loss:.4f}")
print(f"  Train runtime: {train_result.metrics['train_runtime']:.2f}s")


üöÄ B·∫Øt ƒë·∫ßu training...



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,1.1177,0.973632,0.574713,0.60241,0.588235


KeyboardInterrupt: 

## 11. Evaluate Model

In [None]:
print("\nüìä ƒê√°nh gi√° model tr√™n validation set...\n")

eval_results = trainer.evaluate()

print("\n" + "=" * 60)
print("üìà Validation Results:")
print("=" * 60)
print(f"Precision: {eval_results['eval_precision']:.4f}")
print(f"Recall:    {eval_results['eval_recall']:.4f}")
print(f"F1 Score:  {eval_results['eval_f1']:.4f}")
print(f"Loss:      {eval_results['eval_loss']:.4f}")
print("=" * 60)


üìä ƒê√°nh gi√° model tr√™n validation set...




üìà Validation Results:
Precision: 1.0000
Recall:    1.0000
F1 Score:  1.0000
Loss:      0.0194


## 12. L∆∞u Model

In [None]:
# L∆∞u model v√† tokenizer
output_dir = "../../models/phobert_ner_model"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"‚úì ƒê√£ l∆∞u model v√†o: {output_dir}")

# L∆∞u label mapping
import json
with open(f"{output_dir}/label_mapping.json", "w", encoding="utf-8") as f:
    json.dump({"label2id": label2id, "id2label": id2label}, f, ensure_ascii=False, indent=2)

print(f"‚úì ƒê√£ l∆∞u label mapping")

‚úì ƒê√£ l∆∞u model v√†o: ../../models/phobert_ner_model
‚úì ƒê√£ l∆∞u label mapping


## 13. Test Model v·ªõi C√¢u M·∫´u

In [None]:
from transformers import pipeline

# T·∫°o NER pipeline
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # G·ªôp B- v√† I- th√†nh 1 entity
)

# Test cases
test_texts = [
    "T√¥i b·ªã s·ªët cao v√† ho khan",
    "Em b√© c√≥ tri·ªáu ch·ª©ng ƒëau b·ª•ng v√† bu·ªìn n√¥n",
    "C·∫£m c√∫m g√¢y ra c√°c tri·ªáu ch·ª©ng nh∆∞ s·ªët, ho v√† m·ªát m·ªèi",
    "T√¥i b·ªã ƒëau ƒë·∫ßu d·ªØ d·ªôi v√† ch√≥ng m·∫∑t",
    "B·ªánh nh√¢n c√≥ tri·ªáu ch·ª©ng s·ªï m≈©i v√† ngh·∫πt m≈©i"
]

print("\n" + "=" * 60)
print("üß™ TEST MODEL V·ªöI C√ÅC C√ÇU M·∫™U")
print("=" * 60)

for i, text in enumerate(test_texts, 1):
    print(f"\n[{i}] Input: {text}")
    
    entities = ner_pipeline(text)
    
    if entities:
        print("    Entities t√¨m ƒë∆∞·ª£c:")
        for ent in entities:
            print(f"      ‚Ä¢ {ent['word']:<20} [{ent['entity_group']}] (score: {ent['score']:.3f})")
    else:
        print("    (Kh√¥ng t√¨m th·∫•y entity n√†o)")

print("\n" + "=" * 60)
print("\n‚úÖ HO√ÄN TH√ÄNH!")
print("\nüìù B∆∞·ªõc ti·∫øp theo: Ch·∫°y notebook 03_compare_spacy_phobert.ipynb ƒë·ªÉ so s√°nh v·ªõi Spacy")

Device set to use cpu



üß™ TEST MODEL V·ªöI C√ÅC C√ÇU M·∫™U

[1] Input: T√¥i b·ªã s·ªët cao v√† ho khan
    Entities t√¨m ƒë∆∞·ª£c:
      ‚Ä¢ s·ªët cao              [SYMPTOM] (score: 0.953)
      ‚Ä¢ ho khan              [SYMPTOM] (score: 0.958)

[2] Input: Em b√© c√≥ tri·ªáu ch·ª©ng ƒëau b·ª•ng v√† bu·ªìn n√¥n
    Entities t√¨m ƒë∆∞·ª£c:
      ‚Ä¢ ƒëau b·ª•ng             [SYMPTOM] (score: 0.972)
      ‚Ä¢ bu·ªìn n√¥n             [SYMPTOM] (score: 0.972)

[3] Input: C·∫£m c√∫m g√¢y ra c√°c tri·ªáu ch·ª©ng nh∆∞ s·ªët, ho v√† m·ªát m·ªèi
    Entities t√¨m ƒë∆∞·ª£c:
      ‚Ä¢ C·∫£m c√∫m              [DISEASE] (score: 0.909)
      ‚Ä¢ s·ªë@@                 [SYMPTOM] (score: 0.877)
      ‚Ä¢ ho                   [SYMPTOM] (score: 0.918)
      ‚Ä¢ m·ªát m·ªèi              [SYMPTOM] (score: 0.965)

[4] Input: T√¥i b·ªã ƒëau ƒë·∫ßu d·ªØ d·ªôi v√† ch√≥ng m·∫∑t
    Entities t√¨m ƒë∆∞·ª£c:
      ‚Ä¢ ƒëau ƒë·∫ßu              [SYMPTOM] (score: 0.950)
      ‚Ä¢ d·ªØ d·ªôi               [SYMPTOM] (score: 0.843)
      ‚Ä