Labeling otomatis dengan IndoBERT mengacu pada proses memberikan label pada data teks secara otomatis menggunakan model IndoBERT, yang merupakan model bahasa bertingkat tinggi yang dilatih untuk bahasa Indonesia.

IndoBERT adalah versi bahasa Indonesia dari arsitektur BERT (Bidirectional Encoder Representations from Transformers), yang merupakan model bahasa yang dilatih secara besar-besaran menggunakan teknik pembelajaran mendalam (deep learning) untuk pemahaman teks. Model ini telah dilatih pada tugas-tugas seperti pemodelan bahasa, pertanyaan-jawaban, dan tugas-tugas bahasa lainnya.

indobert : https://huggingface.co/indobenchmark/indobert-base-p1

Link Jurnal :
https://aclanthology.org/2020.aacl-main.85/

In [None]:
# Import Package Transformers
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m60.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m93.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

In [None]:
# Import Library
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import pandas as pd

In [None]:
# Load tokenizer and model for sentiment analysis in Indonesian
tokenizer = AutoTokenizer.from_pretrained("indobenchmark/indobert-large-p1")
model = AutoModelForSequenceClassification.from_pretrained("indobenchmark/indobert-large-p1")

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/229k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at indobenchmark/indobert-large-p1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Import data
data = pd.read_excel('scrapped_aplikasi_dana.xlsx')
# Ambil kolom 'Review' dari DataFrame
reviews = data['content']
reviews

0     Tolong dana, lebih maju lagi aplikasinya. Tamb...
1     Untuk aplikasi bagus,fiturnya lengkap,cuma yan...
2     SANGAT PENTING SEKALI, ketika dibutuhkan hampi...
3     Jijik sama chat bot nya, tidak ada layanan cs ...
4     Sudah di update beberapa kali tetapi tetap per...
                            ...                        
95    Aplikasi dana Makin kesini bukan makin baik ma...
96    Tidak user friendly, pada menu pusat resolusi,...
97    Sering error setelah melakukan top up, direfre...
98    Mohon maaf ini min, sering banget terjadi dima...
99    gak bisa login aplikasi, keterangan yang muncu...
Name: content, Length: 100, dtype: object

In [None]:
# Inisialisasi list untuk menyimpan hasil analisis sentimen dan probabilitas
sentiments = []
positive_probabilities = []
negative_probabilities = []

# Loop melalui setiap review dalam kolom 'Review'
for review in reviews:
    # Tokenisasi teks menggunakan tokenizer
    tokens = tokenizer.encode_plus(review, max_length=128, padding="max_length", truncation=True, return_tensors="pt")

    # Melakukan inferensi sentimen pada teks menggunakan model
    outputs = model(**tokens)
    probabilities = outputs.logits.softmax(dim=1)
    positive_probability = probabilities[0][1].item()
    negative_probability = probabilities[0][0].item()

    # Menentukan sentimen berdasarkan probabilitas
    sentiment = "Positive" if positive_probability > negative_probability else "Negative"

    # Menambahkan sentimen dan probabilitas ke dalam list
    sentiments.append(sentiment)
    positive_probabilities.append(positive_probability)
    negative_probabilities.append(negative_probability)

In [None]:
# Menambahkan kolom hasil analisis sentimen dan probabilitas ke dalam DataFrame
data['Sentiment'] = sentiments
data['Positive Probability'] = positive_probabilities
data['Negative Probability'] = negative_probabilities

# Menampilkan DataFrame hasil analisis sentimen
data


Unnamed: 0,userName,score,at,content,Sentiment,Positive Probability,Negative Probability
0,Ayurhmwti,1,2023-06-24 06:50:33,"Tolong dana, lebih maju lagi aplikasinya. Tamb...",Positive,0.448288,0.072378
1,Pelangi Aisyah Insani,3,2023-06-24 04:04:25,"Untuk aplikasi bagus,fiturnya lengkap,cuma yan...",Positive,0.479257,0.078796
2,Ahmad Fajri,1,2023-06-23 20:46:43,"SANGAT PENTING SEKALI, ketika dibutuhkan hampi...",Positive,0.374623,0.081280
3,Jo san,1,2023-06-23 18:12:05,"Jijik sama chat bot nya, tidak ada layanan cs ...",Positive,0.332947,0.115360
4,Rangga Ramadhan,2,2023-06-23 13:09:24,Sudah di update beberapa kali tetapi tetap per...,Negative,0.102553,0.112455
...,...,...,...,...,...,...,...
95,K- Flash,2,2023-04-17 14:10:21,Aplikasi dana Makin kesini bukan makin baik ma...,Negative,0.144789,0.214765
96,Ardi Ant,2,2023-04-15 10:57:47,"Tidak user friendly, pada menu pusat resolusi,...",Positive,0.332458,0.104760
97,Riyadh Achmad,2,2023-04-11 12:56:08,"Sering error setelah melakukan top up, direfre...",Positive,0.327594,0.090234
98,M Rizal Nasrullah,1,2023-04-10 14:41:01,"Mohon maaf ini min, sering banget terjadi dima...",Positive,0.205092,0.117069


In [None]:
# Export Data
data.to_excel('hasil_labeling_review_after_covid_2.xlsx', index=False)