# 03. Labeling
On this notebook, automatic sentiment labeling is performed using a pretrained model.

### Pretrained Model: [IndoBERT](https://huggingface.co/mdhugol/indonesia-bert-sentiment-classification)
We utilize the `mdhugol/indonesia-bert-sentiment-classification` model. It is specifically fine-tuned for Indonesian sentiment tasks, categorizing text into:
* **LABEL_0**: Positive
* **LABEL_1**: Neutral
* **LABEL_2**: Negative

### Validation:
Automatic labels were manually validated by humans using the `full_text_original` column as a reference. Note that data in this repository are for **demonstration purposes** and differ from the actual research dataset.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

## Configuration

In [None]:
PRETRAINED_MODEL = "mdhugol/indonesia-bert-sentiment-classification"
model = AutoModelForSequenceClassification.from_pretrained(PRETRAINED_MODEL)
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL)
sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

## Load Dataset

In [None]:
FILE_PATH = '../data/' 
df = pd.read_csv(FILE_PATH + 'processed_sample.csv', sep=';')

df = df[['created_at', 'full_text_original', 'full_text']].copy()

## Labeling

In [None]:
def get_sentiment_label(text):
    if not isinstance(text, str) or text.strip() == "":
        return "neutral"
        
    result = sentiment_analysis(text)[0]
    label_map = {
        'LABEL_0': 'positive',
        'LABEL_1': 'neutral',
        'LABEL_2': 'negative'
    }
    return label_map.get(result['label'], 'neutral')

df['label'] = df['full_text'].apply(get_sentiment_label)

## Visualization

In [None]:
def plot_sentiment_distribution(dataframe):
    counts = dataframe['label'].value_counts()
    labels = counts.index
    sizes = counts.values
    
    explode = tuple([0.1 if i == 0 else 0 for i in range(len(labels))])
    
    fig, ax = plt.subplots(figsize=(8, 8))
    ax.pie(x=sizes, labels=labels, autopct='%1.1f%%', 
           explode=explode, startangle=140, 
           textprops={'fontsize': 12})
    
    ax.set_title('Sentiment Polarity', fontsize=16, pad=20)
    plt.tight_layout()
    plt.show()

plot_sentiment_distribution(df)

In [None]:
df.to_csv(FILE_PATH + 'labeled_sample.csv', index=False, sep=';')
df.head(10)