## Importing Libraries

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from transformers import pipeline

import pandas as pd
import numpy as np
import re
import json
import emoji
import hf_xet
import os

In [3]:
data = pd.read_csv("data.csv")

# Exploratory Data Analysis
Some EDA processes are done to understand the context of the data better and understand what preprocessing methods are suitable.

In [4]:
data.head()

Unnamed: 0,index,tweet_id,username,tweet_content,tweet_label
0,0,1970083558544089111,Project Multatuli,Apa yang menurutmu salah dari MBG? https://t.c...,
1,1,1967802981862277607,salam4jari,Surat Edaran Proyek Makan Beracun Gratis (MBG)...,
2,2,1968473097616638158,Beby Sweet,Korban keracunan MBG masih terus berjatuhan.\n...,
3,3,1969929640853844095,zhil,MBG adalah bukti paripurna jeleknya kualitas d...,
4,4,1969023678483693785,tempo.co,Menteri Keuangan Purbaya Yudhi Sadewa bakal me...,


In [5]:
data.tail()

Unnamed: 0,index,tweet_id,username,tweet_content,tweet_label
4833,3438,1876842649832903108,nine_tyseven,berkat e tonggoku sing mari meninggal ae luwih...,
4834,3439,1876841006307443159,PaltiWest2024,Menunggu hasil foto dan laporan terkait Progra...,
4835,3440,1876838346657345671,ilmi_roudhotul,Anakmu atau kluarga Gibran dan lainya juga mak...,
4836,3441,1876836098661323049,SamPrd24,@FOODFESS2 @gibran_tweet bisa ga menu makan si...,
4837,3442,1876834866538430595,Iam_Nobody_1145,@kurawa Tanya'in suuuu.. AnakÂ²e @gibran_tweet ...,


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4838 entries, 0 to 4837
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          4838 non-null   int64  
 1   tweet_id       4838 non-null   int64  
 2   username       4838 non-null   object 
 3   tweet_content  4838 non-null   object 
 4   tweet_label    0 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 189.1+ KB


In [7]:
data.shape

(4838, 5)

This dataset is a merged file from:
- Data scraping from X (22 September 2025 to 25 September 2025) under the keyword of "MBG" and "Makan Bergizi Gratis" (top & latest tweets).
- Kaggle dataset, scraped from tweets about MBG (10 February 2025).

The dataset consists of 4838 rows and 5 columns. There is no null for column index, tweet_id, username, and tweet_content. However, for the tweet_label, the value is still missing. This dataset will be labeled using IndoBERT model for tweets.

In [8]:
data.duplicated(subset='tweet_content').sum()

np.int64(0)

There is no duplicated data after the merging.

# Preprocessing
Before labeling the data using Indobert, the data needs to be preprocessed so the Indobert model can predict sentiment more accurately.

In [9]:
with open('slangs.json', 'r') as file:
	slangs = json.load(file)

In [10]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

stopwordsList = set(stopwords.words('indonesian'))

In each tweets/post, every links (http, www, https), mentions (@), hashtags (#), and other non-alphabetic characters (!, ?, ', etc) will be removed. Extra spaces will also be removed as well. Each word will be made as a token and applied with stemmer to turn it as the basic word form. After all steps are done, the words will be combined back as a sentence again.

In [11]:
def cleaning(data):
    temp = []

    for twt in data:
        twt = twt.casefold()
        twt = re.sub(r'http\S+|wwws+|https\S+', '', twt)
        twt = re.sub(r'@\w+|#\w+', '', twt)
        twt = re.sub(r'[\W]', ' ', twt)
        twt = re.sub(r'\s+', ' ', twt).strip()

        words = word_tokenize(twt)
        words = [slangs.get(word, word) for word in words]

        cleaned = []
        for word in words:
            twt = stemmer.stem(twt)
            if word not in stopwordsList:
                cleaned.append(word)

        temp.append(" ".join(cleaned))
        
    return temp

In [12]:
data['preprocessed_tweets'] = cleaning(data['tweet_content'])

Examples of preprocessed data

In [13]:
j = 0
for i in range(0+j, 10+j):
    print(i, data['preprocessed_tweets'].iloc[i])

0 menurutmu salah mbg
1 surat edaran proyek makan beracun gratis mbg poin 7 keracunan dan lain-lain mempublikasikannya bentuk pembungkaman
2 korban keracunan mbg berjatuhan ribuan siswa korban keracunan mbg pengelola dapurnya ditangkap pemilik dapurnya anggota dewan
3 mbg bukti paripurna jeleknya kualitas kompetensi elite penguasa indonesia sih mbg proyek krl pabrik elite membangun krl pabrik bodoh malas gampang bikin dapur
4 menteri keuangan purbaya yudhi sadewa menarik anggaran belanja kementerian lembaga terserap kementerian lembaga disisir anggaran program makan bergizi gratis mbg
5 daftar nama bertanggung proyek mbg menimbulkan korban jiwa
6 gila sih keracunan mbg mencapai ribuan gini satupun dipidana kejahatan dilindungi rezim
7 hentikan mbg evaluasi libatkan ahli masukan rakyat dijadikan program ujicoba berisiko tolong retwit didengar pemerintah
8 mbg investasi bangsa berita enak terdengar terbawa arus pesimisme manfaat mbg nyata memperkuat ketahanan pangan membuka lapangan kerj

There are still some tweets that are written in language other than Indonesian (including out-of-topic tweets that include the word 'MBG' but with different context), so these are dropped.

In [14]:
for i in range(2750, 2750):
    print(data.loc[i, 'tweet_content'])
    print()
print(data.loc[2743, 'tweet_content'])

Lebih menggugah selera daripada makan siang gratis prabowo


In [15]:
data = data.drop(index=[22, 23, 27, 39, 48, 1509, 2743]).reset_index(drop=True)

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4831 entries, 0 to 4830
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   index                4831 non-null   int64  
 1   tweet_id             4831 non-null   int64  
 2   username             4831 non-null   object 
 3   tweet_content        4831 non-null   object 
 4   tweet_label          0 non-null      float64
 5   preprocessed_tweets  4831 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 226.6+ KB


# IndoRoBERTa Labeling

The used IndoRoBERTa model instructed to do these preprocessing steps:
- Change all user mentions (e.g., @username) to @USER so the username does not affect the model's prediction.
- Change all links (e.g., http://ursite.com) to HTTPURL so the link name does not affect the model's prediction as well.
- Demojize the emoji as emoji is widely used in social media platforms, including X, so the emoji might give additional meaning/importance to the model's prediction (e.g., ðŸ˜­ to crying_face may add a hint that the tweet is sad/negative).

In [17]:
def preprocess_tweet(text):

	text = re.sub(r"@\w+", "@USER", text)
	text = re.sub(r"http\S+|www\S+", "HTTPURL", text)

	text = emoji.demojize(text, delimiters=(" ", " "))

	return text

In [18]:
os.environ["HF_HOME"] = "D:/huggingface_cache"

The IndoRoBERTa model is stored locally, imported from huggingface, so the cache needs to be stored somewhere. The directory can be changed depending on the device. To run all this code automatically, create an additional directory under the D folder with the name **huggingface_cache**.

In [19]:
pretrained_name = "w11wo/indonesian-roberta-base-sentiment-classifier"

nlp = pipeline(
    "sentiment-analysis",
    model=pretrained_name,
    tokenizer=pretrained_name
)

Device set to use cpu


The pretrained model is imported from huggingface and the ID is stored inside the pretrained_name variable. It uses RoBERTa (robust version of BERT) and is trained specifically with Indonesian tweets data and it will classify the tweets into 3 labels: positive, neutral, and negative.

The pipeline function is a wrapper to break down texts into tokens (numbers) that can be understood by the model, and then feed the tokens into the pretrained RoBERTa model.

In [20]:
df = pd.read_csv("tweets_cleaned.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,index,tweet_id,username,tweet_content,tweet_label,preprocessed_tweets
0,0,0,1970083558544089111,Project Multatuli,Apa yang menurutmu salah dari MBG? https://t.c...,,menurutmu salah mbg
1,1,1,1967802981862277607,salam4jari,Surat Edaran Proyek Makan Beracun Gratis (MBG)...,,surat edaran proyek makan beracun gratis mbg p...
2,2,2,1968473097616638158,Beby Sweet,Korban keracunan MBG masih terus berjatuhan.\n...,,korban keracunan mbg berjatuhan ribuan siswa k...
3,3,3,1969929640853844095,zhil,MBG adalah bukti paripurna jeleknya kualitas d...,,mbg bukti paripurna jeleknya kualitas kompeten...
4,4,4,1969023678483693785,tempo.co,Menteri Keuangan Purbaya Yudhi Sadewa bakal me...,,menteri keuangan purbaya yudhi sadewa menarik ...


In [21]:
df = df[['preprocessed_tweets', 'tweet_content']]

In [22]:
df.isna().sum()

preprocessed_tweets    11
tweet_content           0
dtype: int64

After preprocessing, several tweets are missing because they do not fit the criteria for the preprocessing (e.g., consisting of only stopwords). These are removed because they likely do not impose any important meaning or significance to the model.

In [23]:
df = df.dropna()

In [24]:
df.shape

(4822, 2)

After removing the missing values, the dataset is now 4822 rows.

In [25]:
df.head()

Unnamed: 0,preprocessed_tweets,tweet_content
0,menurutmu salah mbg,Apa yang menurutmu salah dari MBG? https://t.c...
1,surat edaran proyek makan beracun gratis mbg p...,Surat Edaran Proyek Makan Beracun Gratis (MBG)...
2,korban keracunan mbg berjatuhan ribuan siswa k...,Korban keracunan MBG masih terus berjatuhan.\n...
3,mbg bukti paripurna jeleknya kualitas kompeten...,MBG adalah bukti paripurna jeleknya kualitas d...
4,menteri keuangan purbaya yudhi sadewa menarik ...,Menteri Keuangan Purbaya Yudhi Sadewa bakal me...


In [26]:
df["tweet_label"] = df["preprocessed_tweets"].apply(lambda x: nlp(x)[0]["label"])
df["confidence"] = df["preprocessed_tweets"].apply(lambda x: nlp(x)[0]["score"])

In [27]:
df.head()

Unnamed: 0,preprocessed_tweets,tweet_content,tweet_label,confidence
0,menurutmu salah mbg,Apa yang menurutmu salah dari MBG? https://t.c...,negative,0.999151
1,surat edaran proyek makan beracun gratis mbg p...,Surat Edaran Proyek Makan Beracun Gratis (MBG)...,negative,0.972376
2,korban keracunan mbg berjatuhan ribuan siswa k...,Korban keracunan MBG masih terus berjatuhan.\n...,neutral,0.996815
3,mbg bukti paripurna jeleknya kualitas kompeten...,MBG adalah bukti paripurna jeleknya kualitas d...,negative,0.999448
4,menteri keuangan purbaya yudhi sadewa menarik ...,Menteri Keuangan Purbaya Yudhi Sadewa bakal me...,neutral,0.998951


In [28]:
df.drop("tweet_content", axis=1)

Unnamed: 0,preprocessed_tweets,tweet_label,confidence
0,menurutmu salah mbg,negative,0.999151
1,surat edaran proyek makan beracun gratis mbg p...,negative,0.972376
2,korban keracunan mbg berjatuhan ribuan siswa k...,neutral,0.996815
3,mbg bukti paripurna jeleknya kualitas kompeten...,negative,0.999448
4,menteri keuangan purbaya yudhi sadewa menarik ...,neutral,0.998951
...,...,...,...
4828,berkat e tonggoku sing mari meninggal ae luwih...,positive,0.997661
4829,menunggu hasil foto laporan terkait program un...,neutral,0.998622
4830,anakmu keluarga gibran lainya makan jatah maka...,neutral,0.998356
4831,menu makan siang gratis anak sekolah 10 ribu ...,positive,0.999331


This is the final dataset that has been preprocessed, along with the tweet label and confidence level from the IndoRoBERTa model.

In [None]:
df.to_csv("tweets_labeled_indoroberta.csv", index=False)