# Klasifikasi Teks Berita Menggunakan Model BERT dengan Fine-Tuning

Kode ini bertujuan untuk melakukan klasifikasi teks berita menggunakan model BERT yang telah dilatih sebelumnya. Proses ini mencakup pemuatan dataset, pemrosesan data, tokenisasi, pembuatan dataset PyTorch, dan pelatihan model BERT untuk mengklasifikasikan berita ke dalam kategori yang ditentukan.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Load the dataset from Excel
file_path = 'data_ready_with_kategori.xlsx'
data = pd.read_excel(file_path)

# Filter data to use the text_berita_clean column and ensure it's not empty
filtered_data = data[['text_berita', 'Kategori']].dropna()

# Map string categories to integers
filtered_data['Kategori'] = filtered_data['Kategori'].astype('category')
filtered_data['Kategori_encoded'] = filtered_data['Kategori'].cat.codes

# Debug: Print category mapping
category_mapping = dict(enumerate(filtered_data['Kategori'].cat.categories))
print("Category Mapping:", category_mapping)

# Split the dataset into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    filtered_data['text_berita'],
    filtered_data['Kategori_encoded'],
    random_state=42,
    test_size=0.2
)

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize input texts
train_encodings = tokenizer(list(train_texts), padding=True, truncation=True, max_length=128, return_tensors="pt")
val_encodings = tokenizer(list(val_texts), padding=True, truncation=True, max_length=128, return_tensors="pt")

# Define PyTorch dataset class
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels.values  # Convert labels to NumPy array

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)  # Ensure label dtype is long
        return item

    def __len__(self):
        return len(self.labels)

# Create datasets
train_dataset = CustomDataset(train_encodings, train_labels)
val_dataset = CustomDataset(val_encodings, val_labels)

# Fine-tune a pre-trained BERT model
num_labels = len(filtered_data['Kategori_encoded'].unique())  # Number of unique categories
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./test_trainer",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir='./logs',
    evaluation_strategy="epoch",
    report_to="none"
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

# Save the fine-tuned model
output_dir = "./fine_tuned_bert_text_berita_model"
model.save_pretrained(output_dir)

print(f"Fine-tuned model saved to {output_dir}")

# Save the mapping of categories to integers
print("Category Mapping:", category_mapping)


Category Mapping: {0: 'Ekonomi', 1: 'Hukum', 2: 'Kesehatan', 3: 'Ketenagakerjaan', 4: 'Teknologi'}


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.8841,0.87697
2,0.8167,0.810115
3,0.7928,0.787212


Fine-tuned model saved to ./fine_tuned_bert_text_berita_model
Category Mapping: {0: 'Ekonomi', 1: 'Hukum', 2: 'Kesehatan', 3: 'Ketenagakerjaan', 4: 'Teknologi'}


- Peningkatan Kinerja Model: Penurunan baik pada training loss maupun validation loss menunjukkan bahwa model BERT berhasil dilatih dengan baik dan dapat mengklasifikasikan teks dengan akurasi yang meningkat.
- Generalization: Penurunan validation loss yang sejalan dengan training loss menunjukkan bahwa model tidak mengalami overfitting, di mana model terlalu cocok dengan data pelatihan tetapi gagal dalam generalisasi ke data baru.
- Stabilitas Pelatihan: Jika tren penurunan terus berlanjut di epoch selanjutnya, ini akan menunjukkan stabilitas dalam pelatihan dan potensi peningkatan akurasi lebih lanjut.

In [None]:
# Save the tokenizer
tokenizer.save_pretrained(output_dir)
print(f"Tokenizer saved to {output_dir}")


Tokenizer saved to ./fine_tuned_bert_text_berita_model


In [None]:
import shutil

# Path to the fine-tuned model directory
model_directory = "./fine_tuned_bert_text_berita_model"

# Path to save the zip file
output_zip_path = "fine_tuned_bert_text_berita_model.zip"

# Zip the fine-tuned model directory
shutil.make_archive("fine_tuned_bert_model", 'zip', model_directory)

print(f"Model successfully zipped to {output_zip_path}")


Model successfully zipped to fine_tuned_bert_text_berita_model.zip


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

output_dir = "./fine_tuned_bert_text_berita_model"

# Load fine-tuned model and tokenizer
tokenizer = BertTokenizer.from_pretrained(output_dir)
model = BertForSequenceClassification.from_pretrained(output_dir)

print("Fine-tuned model and tokenizer loaded successfully!")


Fine-tuned model and tokenizer loaded successfully!
