<a href="https://colab.research.google.com/github/LatiefDataVisionary/data-science-capstone-project-college/blob/main/notebooks/04_programmatic_labeling_bilingual.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **04 Programmatic Sentiment Labeling (Bilingual)**

This notebook performs programmatic sentiment labeling on a clean, unlabeled, bilingual (English & Indonesian) dataset. It employs a hybrid, language-specific approach:

- For English reviews, it utilizes the fast, lexicon-based VADER tool.
- For Indonesian reviews, it leverages a powerful, pre-trained Transformer-based model fine-tuned for Indonesian sentiment.

The input is the cleaned dataset, and the output is a single, consistently labeled dataset.

## **1. Setup and Data Loading**

In [1]:
# Install necessary libraries
!pip install langdetect transformers[torch] tqdm
!pip install accelerate -qqq # Required by some Hugging Face models

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/981.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=defd2681538d870637317374915238a060c740a1e931fdac4d6894445201ce4a
  Stored in directory: /root/.cache/pip/wheels/c1/67/88/e844b5b022812e15a52e4eaa38a1e709e99f0

In [2]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers.pipelines import pipeline
from transformers import pipeline
import torch
from tqdm.auto import tqdm
from langdetect import detect, DetectorFactory
from huggingface_hub import login
from google.colab import userdata
import os

# Ensure consistent language detection results
DetectorFactory.seed = 0

tqdm.pandas()

In [3]:
# Download the VADER lexicon
try:
    nltk.data.find('sentiment/vader_lexicon.zip')
except Exception: # Catching generic Exception as DownloadError might not be directly accessible
    nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [4]:
# reviews_cleaned_en = 'https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_en.csv'
# reviews_cleaned_id = 'https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_id.csv'

In [5]:
reviews_cleaned_en = 'https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_en_tokenized.csv'
reviews_cleaned_id = 'https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_id_tokenized.csv'

In [9]:
# Load the cleaned datasets separately
reviews_cleaned_en = 'https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_en_tokenized.csv'
reviews_cleaned_id = 'https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_id_tokenized.csv'

try:
    df_en = pd.read_csv(reviews_cleaned_en)
    print(f"English data loaded from {reviews_cleaned_en}.")
except Exception as e:
    print(f"Error loading English data: {e}")
    df_en = pd.DataFrame() # Create an empty DataFrame in case of error

try:
    df_id = pd.read_csv(reviews_cleaned_id)
    print(f"Indonesian data loaded from {reviews_cleaned_id}.")
except Exception as e:
    print(f"Error loading Indonesian data: {e}")
    df_id = pd.DataFrame() # Create an empty DataFrame in case of error

# Display the heads of the separate dataframes to confirm loading
print("\nEnglish DataFrame head:")
display(df_en.head())

print("\nIndonesian DataFrame head:")
display(df_id.head())

English data loaded from https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_en_tokenized.csv.
Indonesian data loaded from https://raw.githubusercontent.com/LatiefDataVisionary/data-science-capstone-project-college/refs/heads/main/data/processed/reviews_cleaned_id_tokenized.csv.

English DataFrame head:


Unnamed: 0,content,score,thumbsUpCount,cleaned_content
0,"they fixed it, I was just really pissy yesterd...",5,1,"['fixed', 'really', 'pissy', 'yesterday', 'cau..."
1,"Offline doesnt work, support doesnt help, just...",1,0,"['offline', 'doesnt', 'work', 'support', 'does..."
2,Super annoying ad experience! It feels like th...,1,5,"['super', 'annoying', 'ad', 'experience', 'fee..."
3,👍,5,1,[]
4,super song for everything,5,0,"['super', 'song', 'everything']"



Indonesian DataFrame head:


Unnamed: 0,content,score,thumbsUpCount,cleaned_content
0,lagu bukan hanya alunan nada tapi bisa jadi un...,1,2,"['lagu', 'alunan', 'nada', 'ungkapan', 'kebeba..."
1,iklan Mulu gak jelass apa apa harus premium ko...,1,0,"['iklan', 'mulu', 'tidak', 'jelass', 'premium'..."
2,Terima kasih banyak 🙏👍👍👍,5,0,"['terima', 'kasih']"
3,kok di aku mah gk bisa ada lirik ya sih tolong...,4,0,"['mah', 'gk', 'lirik', 'ya', 'sih', 'tolong', ..."
4,sangat banyak lagu nya,5,0,"['lagu', 'nya']"


## 2. Language Detection: The Core Logic Gate

The first crucial step is to reliably detect the language of each review. This will act as a switch to direct the review to the correct labeling pipeline, ensuring that English reviews are processed by VADER and Indonesian reviews by the Transformer model. A robust language detection library is essential for this task.

In [12]:
# # Function to detect language with error handling
# def detect_language(text):
#     try:
#         # Langdetect can struggle with very short texts
#         if pd.isna(text) or len(str(text).strip()) < 5:
#             return 'unknown'
#         return detect(str(text))
#     except:
#         return 'unknown' # Handle potential errors during detection

# # Apply language detection
# df['language'] = df['content'].progress_apply(detect_language)

In [13]:
# # Show the distribution of detected languages
# display(df['language'].value_counts())

## 3. Pipeline A - Labeling English Reviews with VADER

For English reviews, we will use VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and it is known for being fast and effective for English text. It provides a compound score ranging from -1 (most negative) to +1 (most positive).

In [14]:
# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

In [15]:
# Function to label English sentiment using VADER compound score
def label_english_sentiment(text):
    if pd.isna(text):
        return None
    scores = sia.polarity_scores(str(text))
    compound_score = scores['compound']
    if compound_score >= 0.03:
        return 'positive'
    elif compound_score <= -0.02:
        return 'negative'
    else:
        return 'neutral'

In [16]:
# Apply VADER labeling to English reviews
# Initialize sentiment_label column to None in df_en
df_en['sentiment_label'] = None

# Apply the labeling function to English reviews in df_en
df_en['sentiment_label'] = df_en['content'].progress_apply(label_english_sentiment)

# Check how many English reviews were labeled
print(f"Number of English reviews labeled: {df_en['sentiment_label'].notna().sum()}")

# Display a few examples of English reviews and their VADER-generated labels
print("\nExamples of English reviews and their VADER labels:")
display(df_en[['content', 'sentiment_label']].head())

NameError: name 'df' is not defined

## 4. Pipeline B - Labeling Indonesian Reviews with a Transformer Model

For Indonesian reviews, a more sophisticated, context-aware model is needed as VADER is not designed for this language. We will use a pre-trained Transformer model that has been fine-tuned for Indonesian sentiment analysis. This type of model captures nuances in language more effectively than lexicon-based methods for complex languages.

In [None]:
# Ambil token dari Colab Secrets
# Pastikan nama secret-nya adalah 'HF_TOKEN'
hf_token = userdata.get('HF_TOKEN')

# Lakukan login
login(token=hf_token)

print("Login Hugging Face berhasil!")

In [None]:
# Define the model checkpoint for an Indonesian sentiment model
# Using "w11wo/indonesian-roberta-base-sentiment-classifier" with pipeline.
# If you encounter issues loading the model, ensure you are authenticated with Hugging Face
# (e.g., by setting a HF_TOKEN in Colab secrets) or try a different public model.
model_checkpoint = "w11wo/indonesian-roberta-base-sentiment-classifier"

# Load the sentiment analysis pipeline
try:
    sentiment_pipeline = pipeline("sentiment-analysis", model=model_checkpoint)
    print(f"Pipeline loaded successfully using model: {model_checkpoint}")
except Exception as e:
    print(f"Error loading pipeline with model {model_checkpoint}: {e}")
    sentiment_pipeline = None # Set to None if loading fails

# Define a simple mapping for labels if needed (check model card for exact labels)
# This model typically outputs 'LABEL_0', 'LABEL_1', 'LABEL_2'
# You might need to inspect the model output or model card to map these to 'negative', 'neutral', 'positive'
# For now, let's assume a common mapping or inspect the output later.
# Based on the model card, LABEL_0=negative, LABEL_1=neutral, LABEL_2=positive
id_label_mapping = {'LABEL_0': 'negative', 'LABEL_1': 'neutral', 'LABEL_2': 'positive'}

### Tutorial: Menghubungkan Token Hugging Face ke Google Colab

Untuk dapat mengakses model-model tertentu di Hugging Face dari Google Colab, terutama jika model tersebut memerlukan autentikasi, Anda perlu menghubungkan token akses Hugging Face Anda ke lingkungan Colab. Cara paling aman adalah dengan menggunakan fitur **Colab Secrets**.

Berikut langkah-langkahnya:

1.  **Dapatkan Token Akses dari Hugging Face:**
    *   Buka website Hugging Face: [huggingface.co](https://huggingface.co/)
    *   Login ke akun Hugging Face Anda. Jika belum punya, daftar terlebih dahulu.
    *   Setelah login, klik **foto profil Anda** di kanan atas halaman.
    *   Pilih **"Settings"** dari menu dropdown.
    *   Di menu navigasi sebelah kiri pada halaman Settings, klik **"Access Tokens"**.
    *   Klik tombol **"New token"** untuk membuat token baru.
    *   Beri nama token Anda (misalnya, `colab-access`, `my-project-token`, dll.). Nama ini hanya untuk identifikasi di akun Hugging Face Anda.
    *   Pilih peran (Role) token: Untuk memuat model, peran **"read"** sudah cukup. Jika Anda berencana mengunggah sesuatu, pilih "write".
    *   Klik tombol **"Generate token"**.
    *   Token akan muncul di layar. **Salin token ini segera** karena Anda tidak akan bisa melihatnya lagi nanti. Simpan di tempat yang aman sementara jika perlu, tapi jangan masukkan langsung ke kode notebook yang akan dibagikan.

2.  **Simpan Token di Google Colab Secrets:**
    *   Kembali ke Google Colab notebook Anda.
    *   Di sidebar kiri Colab, temukan dan klik ikon **Kunci (🔒)**. Ini adalah panel Secrets.
    *   Klik tombol **"+ New secret"**.
    *   Pada kolom **"Name"**, masukkan nama secret Anda. **Sangat penting** untuk menggunakan nama yang akan Anda panggil di kode. Dalam kasus notebook ini, kita menggunakan nama **`HF_TOKEN`**. Jadi, ketik `HF_TOKEN`.
    *   Pada kolom **"Value"**, tempel (paste) token akses Hugging Face yang baru saja Anda salin.
    *   Pastikan tombol **"Notebook access"** diaktifkan (berwarna hijau atau tercentang). Ini memungkinkan notebook Anda mengakses secret ini.
    *   (Opsional) Anda bisa menambahkan deskripsi untuk secret ini.
    *   Setelah selesai, secret akan tersimpan secara otomatis.

3.  **Gunakan Token di Kode Colab untuk Login:**
    *   Sekarang, Anda bisa menggunakan kode Python di notebook Anda untuk mengambil secret `HF_TOKEN` dan melakukan login ke Hugging Face. Kode ini akan mengambil token dari Colab Secrets, bukan dari kode yang terlihat, sehingga lebih aman.

Berikut kode Python yang perlu Anda jalankan di notebook Anda (ini adalah sel yang sebelumnya error `SecretNotFoundError`):

In [None]:
# Define the tokenizer and model globally
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

# Define the label mapping globally
id_labels = ['negative', 'neutral', 'positive'] # Based on the model card: LABEL_0=negative, LABEL_1=neutral, LABEL_2=positive

# Function to label Indonesian sentiment using the Transformer model
def label_indonesian_sentiment(text):
    if pd.isna(text):
        return None
    try:
        # Tokenize the text
        inputs = tokenizer(str(text), return_tensors="pt", truncation=True, padding=True, max_length=512)

        # Pass the tokens through the model
        with torch.no_grad():
            outputs = model(**inputs)

        # Get the logits and apply softmax to get probabilities
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=1)[0]

        # Determine the predicted label (index with the highest probability)
        predicted_class_id = probabilities.argmax().item()

        # Return the corresponding label
        return id_labels[predicted_class_id]

    except Exception as e:
        print(f"Error labeling text: {text[:50]}... Error: {e}")
        return None # Return None if labeling fails

In [None]:
# Apply Transformer labeling to Indonesian reviews that are not yet labeled
indonesian_mask = (df['language'] == 'id') & (df['sentiment_label'].isna())
df.loc[indonesian_mask, 'sentiment_label'] = df.loc[indonesian_mask, 'content'].progress_apply(label_indonesian_sentiment)

# Check how many Indonesian reviews were labeled
print(f"Number of Indonesian reviews labeled: {df[indonesian_mask]['sentiment_label'].notna().sum()}")

## 5. Reviewing and Finalizing the Labeled Dataset

In [None]:
# Check for any reviews that were not labeled
unlabeled_count = df['sentiment_label'].isnull().sum()
print(f"Number of reviews that were not labeled: {unlabeled_count}")
if unlabeled_count > 0:
    print("Check the language detection results for these reviews, or if any errors occurred during labeling.")

In [None]:
# Display the final distribution of the combined labels
display(df['sentiment_label'].value_counts())

In [None]:
# Display a few examples of English reviews and their VADER-generated labels
print("Examples of English reviews and their VADER labels:")
display(df[df['language'] == 'en'][['content', 'sentiment_label']].head())

In [None]:
# Display a few examples of Indonesian reviews and their Transformer-generated labels
print("\nExamples of Indonesian reviews and their Transformer labels:")
display(df[df['language'] == 'id'][['content', 'sentiment_label']].head())

## 6. Saving the Labeled Dataset

In [None]:
# Define the output directory and filenames
output_dir = '../data/processed/'
# output_path_en = os.path.join(output_dir, 'reviews_labeled_en.csv')
# output_path_id = os.path.join(output_dir, 'reviews_labeled_id.csv')
output_path_en = os.path.join(output_dir, 'reviews_labeled_en_tokenized.csv')
output_path_id = os.path.join(output_dir, 'reviews_labeled_id_tokenized.csv')

output_path_other = os.path.join(output_dir, 'reviews_labeled_other_languages.csv') # To save other languages

# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
print(f"Output directory '{output_dir}' ensured to exist.")

# Filter the DataFrame by language
df_en_labeled = df[df['language'] == 'en']
df_id_labeled = df[df['language'] == 'id']
df_other_languages = df[~df['language'].isin(['en', 'id'])]


# Save the labeled DataFrames to separate CSV files
try:
    df_en_labeled.to_csv(output_path_en, index=False)
    print(f"Labeled English dataset saved to {output_path_en}")
except Exception as e:
    print(f"Error saving English dataset: {e}")

try:
    df_id_labeled.to_csv(output_path_id, index=False)
    print(f"Labeled Indonesian dataset saved to {output_path_id}")
except Exception as e:
    print(f"Error saving Indonesian dataset: {e}")

try:
    df_other_languages.to_csv(output_path_other, index=False)
    print(f"Labeled other languages dataset saved to {output_path_other}")
except Exception as e:
    print(f"Error saving other languages dataset: {e}")

## Conclusion

This notebook successfully implemented a hybrid, language-specific approach to sentiment labeling on a bilingual dataset. By first detecting the language of each review, we were able to apply the most appropriate tool for each language – VADER for English and a Transformer model for Indonesian – resulting in a consistently labeled dataset ready for further analysis or model training.