# **Medical Named Entity Recognition (NER)**

Membandingkan akurasi vs efisiensi komputasi. Disini saya membandingkan model "berat" (seperti BERT) dengan model "ringan" (seperti scispacy/CNN) untuk membuktikan mana yang paling cocok diterapkan di rumah sakit di negara berkembang (Indonesia) yang servernya terbatas.

## **Eksperimen 1**

**Dataset:** BC5CDR (BioCreative V CDR Task)

**Tujuan:** Membangun model AI untuk mendeteksi entitas Penyakit (Disease) dan Zat Kimia (Chemical) secara otomatis dari teks medis.


### **Tahap 1: Persiapan Data (Data Preparation)**
Langkah ini bertujuan untuk memuat data mentah (JSON) dan memverifikasi strukturnya sebelum masuk ke pelatihan model.

**Import Library Utama**

In [2]:
# --- Import Library Utama ---
import json
import os
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

# Library Transformers (Hugging Face)
from transformers import AutoTokenizer, AutoModelForTokenClassification

from torch.optim import AdamW 
from tqdm.auto import tqdm # Untuk loading bar

# Mengatur tampilan pandas agar tabel tidak terpotong
pd.set_option('display.max_colwidth', None)

# Cek apakah menggunakan GPU (NVIDIA) atau CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"‚öôÔ∏è Perangkat yang digunakan: {device}")

‚öôÔ∏è Perangkat yang digunakan: cuda


**Fungsi untuk Membaca File JSON**

In [4]:
# --- Fungsi untuk Membaca File JSON ---
def load_json_file(file_path):
    data = []
    try:
        # Coba baca format JSON Lines (per baris)
        with open(file_path, 'r', encoding='utf-8') as f:
            data = [json.loads(line) for line in f]
        print(f"‚úÖ Berhasil memuat: {os.path.basename(file_path)} ({len(data)} kalimat)")
        return data
    except Exception as e:
        print(f"‚ùå Gagal memuat {file_path}: {e}")
        return []

# Tentukan lokasi folder dataset
dataset_path = 'bc5cdr'

**Memuat Dataset Train, Validasi, dan Test**

In [5]:
# --- Memuat Data Training, Validasi, dan Testing ---

print("üìÇ Sedang membaca dataset...")
train_data = load_json_file(os.path.join(dataset_path, 'train.json'))
valid_data = load_json_file(os.path.join(dataset_path, 'valid.json'))
test_data  = load_json_file(os.path.join(dataset_path, 'test.json'))

üìÇ Sedang membaca dataset...
‚úÖ Berhasil memuat: train.json (5228 kalimat)
‚úÖ Berhasil memuat: valid.json (5330 kalimat)
‚úÖ Berhasil memuat: test.json (5865 kalimat)


**Konfigurasi Mapping Label**

In [6]:
# --- Memuat Label & Membuat Mapping ---

try:
    with open(os.path.join(dataset_path, 'label.json'), 'r') as f:
        label_map = json.load(f) # Isinya {'B-Chemical': 1, ...}

    # Kita butuh kebalikannya: Angka -> Nama Label (untuk manusia membaca)
    id2label = {v: k for k, v in label_map.items()}
    label2id = label_map

    print("\n‚úÖ Label Map berhasil dimuat.")
    print("Daftar Kategori:")
    for id_angka, nama in id2label.items():
        print(f"  {id_angka}: {nama}")

except FileNotFoundError:
    print("‚ùå label.json tidak ditemukan! Pastikan file ada di folder 'bc5cdr'.")


‚úÖ Label Map berhasil dimuat.
Daftar Kategori:
  0: O
  1: B-Chemical
  2: B-Disease
  3: I-Disease
  4: I-Chemical


**Visualisasi Sampel Data Training**

In [8]:
# Konversi ke Pandas DataFrame untuk kemudahan visualisasi
df_train = pd.DataFrame(train_data)

# Fungsi untuk menerjemahkan list angka tag menjadi list nama label
def angka_ke_label(tag_list):
    return [id2label[x] for x in tag_list]

# Kita ambil 5 sampel untuk dicek
df_sample = df_train.head(5).copy()

# Buat kolom baru yang isinya nama label (bukan angka) supaya mudah dibaca manusia
df_sample['tags_readable'] = df_sample['tags'].apply(angka_ke_label)

# Tampilkan kolom tokens (katanya) dan tags_readable (labelnya)
display(df_sample[['tokens', 'tags_readable']])

Unnamed: 0,tokens,tags_readable
0,"[Naloxone, reverses, the, antihypertensive, effect, of, clonidine, .]","[B-Chemical, O, O, O, O, O, B-Chemical, O]"
1,"[In, unanesthetized, ,, spontaneously, hypertensive, rats, the, decrease, in, blood, pressure, and, heart, rate, produced, by, intravenous, clonidine, ,, 5, to, 20, micrograms, /, kg, ,, was, inhibited, or, reversed, by, nalozone, ,, 0, .]","[O, O, O, O, B-Disease, O, O, O, O, O, O, O, O, O, O, O, O, B-Chemical, O, O, O, O, O, O, O, O, O, O, O, O, O, B-Chemical, O, O, O]"
2,"[2, to, 2, mg, /, kg, .]","[O, O, O, O, O, O, O]"
3,"[The, hypotensive, effect, of, 100, mg, /, kg, alpha-methyldopa, was, also, partially, reversed, by, naloxone, .]","[O, B-Disease, O, O, O, O, O, O, B-Chemical, O, O, O, O, O, B-Chemical, O]"
4,"[Naloxone, alone, did, not, affect, either, blood, pressure, or, heart, rate, .]","[B-Chemical, O, O, O, O, O, O, O, O, O, O, O]"


### **Tahap 2: Tokenisasi dan Penyelarasan Label (Label Alignment)**
Kita akan menggunakan **BERT Tokenizer** (`bert-base-cased`).
Model ini dipilih karena "Cased" (memperhatikan huruf besar/kecil) sangat penting untuk mendeteksi nama obat atau penyakit (contoh: "Vitamin D" vs "d").

Tantangan utama di sini adalah **Sub-word Tokenization**.
Jika kata `Hydroxychloroquine` (1 kata) dipecah menjadi `['Hy', '##dro', '##xy', ...]` (banyak token), kita harus memastikan label `B-Chemical` hanya menempel pada token pertama (`Hy`), sedangkan pecahan sisanya kita beri label `-100` (agar diabaikan saat perhitungan error/loss nanti).

**Memuat BERT Tokenizer**

In [22]:
# --- Memuat Tokenizer BioBERT ---

# Kita gunakan model yang sama untuk tokenizer
MODEL_CHECKPOINT = "dmis-lab/biobert-v1.1"

print(f"‚è≥ Sedang mengunduh vocabulary {MODEL_CHECKPOINT}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

print("‚úÖ Tokenizer BioBERT siap!")

‚è≥ Sedang mengunduh vocabulary dmis-lab/biobert-v1.1...
‚úÖ Tokenizer BioBERT siap!


In [23]:
from transformers import AutoTokenizer

# Kita gunakan "bert-base-cased"
# Model ini standar industri untuk tugas NER dasar
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

print("‚úÖ Tokenizer berhasil dimuat!")

# Tes pada satu kata yang sulit
contoh_kata = "Hydrochlorothiazide"
hasil_token = tokenizer.tokenize(contoh_kata)
print(f"Contoh pemecahan kata '{contoh_kata}':")
print(hasil_token)

‚úÖ Tokenizer berhasil dimuat!
Contoh pemecahan kata 'Hydrochlorothiazide':
['H', '##ydro', '##ch', '##lor', '##oth', '##ia', '##zi', '##de']


**Fungsi Tokenisasi & Alignment**

In [24]:
# --- Memuat Tokenizer & Fungsi Alignment ---
from transformers import AutoTokenizer

# Kita pakai model standar yang AMAN (mendukung SafeTensors)
MODEL_CHECKPOINT = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

# Fungsi untuk memotong kata dan menyamakan label
def tokenize_and_align_labels(dataset_list):
    all_tokens = [item["tokens"] for item in dataset_list]
    all_tags   = [item["tags"] for item in dataset_list]

    tokenized_inputs = tokenizer(
        all_tokens, 
        truncation=True, 
        is_split_into_words=True,
        max_length=128,
        padding="max_length"
    )
    labels = []
    for i, label_asli in enumerate(all_tags):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        prev_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == prev_idx:
                label_ids.append(-100)
            else:
                try:
                    label_ids.append(label_asli[word_idx])
                except:
                    label_ids.append(-100)
            prev_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

print("‚úÖ Tokenizer siap digunakan.")

‚úÖ Tokenizer siap digunakan.


**Proses Data dan DataLoader**

In [25]:
# --- Proses Data & Buat DataLoader ---
import torch
from torch.utils.data import Dataset, DataLoader

# 1. Proses Data
print("‚è≥ Sedang memproses data...")
tokenized_train = tokenize_and_align_labels(train_data)
tokenized_valid = tokenize_and_align_labels(valid_data)

# 2. Buat Wadah Dataset
class NERDataset(Dataset):
    def __init__(self, encodings): self.encodings = encodings
    def __getitem__(self, i): return {k: torch.tensor(v[i]) for k, v in self.encodings.items()}
    def __len__(self): return len(self.encodings["input_ids"])

# 3. Buat Pengirim Data (Loader)
train_loader = DataLoader(NERDataset(tokenized_train), batch_size=8, shuffle=True)
valid_loader = DataLoader(NERDataset(tokenized_valid), batch_size=8)

print("‚úÖ Data siap dilatih!")

‚è≥ Sedang memproses data...
‚úÖ Data siap dilatih!


**Cek Hasil Tokenisasi**

In [26]:
# Kita ambil sampel index ke-0
index = 0
input_ids = tokenized_train["input_ids"][index]
labels = tokenized_train["labels"][index]

print("Original Tokens:", train_data[index]["tokens"])
print("-" * 30)

# Kita loop untuk melihat pasangan Token (Kata) dengan Label ID-nya
print(f"{'TOKEN':<15} {'LABEL ID':<10} {'ARTI LABEL'}")
for id_token, id_label in zip(input_ids, labels):
    # Abaikan padding (id 0) agar tampilan tidak kepanjangan
    if id_token == 0: continue 
    
    token_str = tokenizer.decode([id_token])
    
    # Terjemahkan label
    if id_label == -100:
        label_str = "IGNORE"
    else:
        label_str = id2label[id_label]
        
    print(f"{token_str:<15} {id_label:<10} {label_str}")

Original Tokens: ['Naloxone', 'reverses', 'the', 'antihypertensive', 'effect', 'of', 'clonidine', '.']
------------------------------
TOKEN           LABEL ID   ARTI LABEL
[CLS]           -100       IGNORE
Na              1          B-Chemical
##lo            -100       IGNORE
##xon           -100       IGNORE
##e             -100       IGNORE
reverse         0          O
##s             -100       IGNORE
the             0          O
anti            0          O
##hy            -100       IGNORE
##pert          -100       IGNORE
##ens           -100       IGNORE
##ive           -100       IGNORE
effect          0          O
of              0          O
c               1          B-Chemical
##lon           -100       IGNORE
##id            -100       IGNORE
##ine           -100       IGNORE
.               0          O
[SEP]           -100       IGNORE


### **Tahap 3: Pelatihan Model (Fine-Tuning)**
Kita akan mengubah data hasil tokenisasi menjadi format yang bisa diterima oleh PyTorch (`Dataset` dan `DataLoader`), memuat model BERT, dan menjalankan loop pelatihan.

Komponen Utama:
1.  **NERDataset:** Wadah pembungkus data.
2.  **DataLoader:** Pengirim data secara *batch* (paket kecil) ke model agar RAM tidak meledak.
3.  **Model:** `BertForTokenClassification`.
4.  **Optimizer:** `AdamW` (Algoritma untuk mengupdate otak model).

**Memuat Model & Optimizer**

In [27]:
# --- Siapkan Model & Optimizer ---
from transformers import AutoModelForTokenClassification
from torch.optim import AdamW # Ambil dari torch biar tidak error

# 1. Load Model
print(f"‚è≥ Memuat model {MODEL_CHECKPOINT}...")
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_CHECKPOINT, 
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id
)

# 2. Pindah ke GPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# 3. Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

print("‚úÖ Model siap dilatih!")

‚è≥ Memuat model bert-base-cased...


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Model siap dilatih!


**Training Loop**

In [32]:
# --- ULTIMATE TRAINING (MAXIMIZED PERFORMANCE) ---
from transformers import get_linear_schedule_with_warmup
import torch
from torch.optim import AdamW
from tqdm.auto import tqdm

# KONFIGURASI MAKSIMAL
EPOCHS = 10           # Naik drastis supaya hafal mati
LEARNING_RATE = 2e-5  # Lebih kecil = Lebih teliti (Precision)
BATCH_SIZE = 8        # Tetap 8 agar aman di memori

print(f"üî• Memulai Training Mode 'HARDCORE' ({EPOCHS} Epochs)...")
print("   Strategi: Belajar pelan (Low LR) tapi lama (High Epochs)")

# Reset Model & Optimizer (Penting! Kita mulai dari nol lagi biar bersih)
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_CHECKPOINT, 
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id
)
model.to(device)

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=0.01)

# Scheduler: Mengatur kecepatan belajar (Mulai pelan, ngebut, lalu pelan lagi di akhir)
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=total_steps
)

# Loop Training
model.train()

for epoch in range(EPOCHS):
    total_loss = 0
    pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")
    
    for batch in pbar:
        batch = {k: v.to(device) for k, v in batch.items()}
        
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        
        # Gradient Clipping (Mencegah error meledak saat belajar terlalu keras)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
        pbar.set_postfix({'loss': loss.item()})
        
    avg_loss = total_loss / len(train_loader)
    print(f"‚úÖ Selesai Epoch {epoch+1}. Rata-rata Loss: {avg_loss:.4f}")

print("\nüéâ Training Maksimal Selesai! Otak model sekarang sudah 'diperas' habis-habisan.")

üî• Memulai Training Mode 'HARDCORE' (10 Epochs)...
   Strategi: Belajar pelan (Low LR) tapi lama (High Epochs)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:18<00:00,  8.37it/s, loss=0.106]  


‚úÖ Selesai Epoch 1. Rata-rata Loss: 0.1326


Epoch 2/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:24<00:00,  7.73it/s, loss=0.00552] 


‚úÖ Selesai Epoch 2. Rata-rata Loss: 0.0462


Epoch 3/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:22<00:00,  7.97it/s, loss=0.00329] 


‚úÖ Selesai Epoch 3. Rata-rata Loss: 0.0240


Epoch 4/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:23<00:00,  7.86it/s, loss=0.000982]


‚úÖ Selesai Epoch 4. Rata-rata Loss: 0.0114


Epoch 5/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:23<00:00,  7.84it/s, loss=0.000721]


‚úÖ Selesai Epoch 5. Rata-rata Loss: 0.0072


Epoch 6/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:23<00:00,  7.81it/s, loss=0.00118] 


‚úÖ Selesai Epoch 6. Rata-rata Loss: 0.0043


Epoch 7/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:22<00:00,  7.93it/s, loss=0.000152]


‚úÖ Selesai Epoch 7. Rata-rata Loss: 0.0020


Epoch 8/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:21<00:00,  8.00it/s, loss=8.57e-5] 


‚úÖ Selesai Epoch 8. Rata-rata Loss: 0.0016


Epoch 9/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:21<00:00,  7.98it/s, loss=2.72e-5] 


‚úÖ Selesai Epoch 9. Rata-rata Loss: 0.0006


Epoch 10/10: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 654/654 [01:19<00:00,  8.26it/s, loss=0.000107]

‚úÖ Selesai Epoch 10. Rata-rata Loss: 0.0005

üéâ Training Maksimal Selesai! Otak model sekarang sudah 'diperas' habis-habisan.





### **Tahap 4: Evaluasi dan Uji Coba (Inference)**


**Evaluasi Model pada Data Validasi**

In [33]:
# --- Evaluasi Akurasi Pasca-Optimasi ---
from seqeval.metrics import classification_report
import numpy as np

print("üìä Menghitung Rapot Akhir (Maximized)...")
model.eval()

pred_list, label_list = [], []

with torch.no_grad():
    for batch in valid_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs.logits, dim=-1)
        
        predictions = predictions.detach().cpu().numpy()
        labels = labels.detach().cpu().numpy()
        
        for i in range(len(labels)):
            temp_pred = [id2label[p] for p, l in zip(predictions[i], labels[i]) if l != -100]
            temp_label = [id2label[l] for l in labels[i] if l != -100]
            pred_list.append(temp_pred)
            label_list.append(temp_label)

print(classification_report(label_list, pred_list))

üìä Menghitung Rapot Akhir (Maximized)...
              precision    recall  f1-score   support

    Chemical       0.92      0.92      0.92      5325
     Disease       0.80      0.84      0.82      4223

   micro avg       0.86      0.88      0.87      9548
   macro avg       0.86      0.88      0.87      9548
weighted avg       0.87      0.88      0.87      9548



**Test Manual dengan Kalimat Sendiri**

In [35]:
# --- Tes Manual PRO (Tampilan Bersih) ---
from transformers import pipeline

# Pipeline dengan strategi penggabungan (Aggregation)
# Ini akan otomatis menyatukan "Met" + "##for" + "##min" -> "Metformin"
ner_pipeline = pipeline(
    "token-classification", 
    model=model, 
    tokenizer=tokenizer, 
    aggregation_strategy="simple", # INI KUNCINYA
    device=0 if torch.cuda.is_available() else -1
)

def tes_canggih(kalimat):
    print(f"\nüìù Input: {kalimat}")
    print("-" * 60)
    print(f"{'ENTITAS (OBAT/PENYAKIT)':<30} | {'KATEGORI':<15} | {'YAKIN?'}")
    print("-" * 60)
    
    hasil = ner_pipeline(kalimat)
    for h in hasil:
        # Hanya tampilkan jika yakin di atas 50%
        if h['score'] > 0.5:
            print(f"üíé {h['word']:<30} | {h['entity_group']:<15} | {h['score']:.1%}")

# UJI COBA
tes_canggih("The patient was prescribed Aspirin and Metformin for his chronic heart failure.")
tes_canggih("Long-term use of Ibuprofen can lead to kidney damage.")

Device set to use cuda:0



üìù Input: The patient was prescribed Aspirin and Metformin for his chronic heart failure.
------------------------------------------------------------
ENTITAS (OBAT/PENYAKIT)        | KATEGORI        | YAKIN?
------------------------------------------------------------
üíé Aspirin                        | Chemical        | 84.6%
üíé Metformin                      | Chemical        | 100.0%
üíé chronic heart failure          | Disease         | 95.2%

üìù Input: Long-term use of Ibuprofen can lead to kidney damage.
------------------------------------------------------------
ENTITAS (OBAT/PENYAKIT)        | KATEGORI        | YAKIN?
------------------------------------------------------------
üíé I                              | Chemical        | 100.0%
üíé ##buprofen                     | Chemical        | 96.3%
üíé kidney damage                  | Disease         | 100.0%


**Simpan Model**

In [36]:
# --- SIMPAN MODEL FINAL ---
output_dir = "./model_medis_final_87persen"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"‚úÖ Model tersimpan aman di folder: {output_dir}")
print("Siap untuk dibandingkan dengan Model Ringan!")

‚úÖ Model tersimpan aman di folder: ./model_medis_final_87persen
Siap untuk dibandingkan dengan Model Ringan!
