# Sentiment Analysis: Thai Text Classification

## Objective
‡∏™‡∏£‡πâ‡∏≤‡∏á‡∏£‡∏∞‡∏ö‡∏ö‡∏ß‡∏¥‡πÄ‡∏Ñ‡∏£‡∏≤‡∏∞‡∏´‡πå‡∏Ñ‡∏ß‡∏≤‡∏°‡∏£‡∏π‡πâ‡∏™‡∏∂‡∏Å (Sentiment Analysis) ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏à‡∏≥‡πÅ‡∏ô‡∏Å‡∏£‡∏µ‡∏ß‡∏¥‡∏ß‡πÄ‡∏õ‡πá‡∏ô **positive**, **neutral**, ‡∏´‡∏£‡∏∑‡∏≠ **negative**

## Approach
‡πÉ‡∏ä‡πâ **Pre-trained Model** ‡∏à‡∏≤‡∏Å Hugging Face ‡∏ó‡∏µ‡πà‡∏ñ‡∏π‡∏Å fine-tuned ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö‡∏†‡∏≤‡∏©‡∏≤‡πÑ‡∏ó‡∏¢‡πÅ‡∏•‡πâ‡∏ß:
- **Model**: WangchanBERTa (fine-tuned for sentiment analysis)
- **Dataset**: Wisesight Sentiment Corpus (Thai social media text)

## Evaluation Metrics
- Accuracy
- Precision
- Recall
- F1-score
- Confusion Matrix

## 1. Install and Import Libraries

In [3]:
# Install required packages (uncomment if needed)
# !pip install transformers datasets torch scikit-learn matplotlib seaborn

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Hugging Face libraries
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Scikit-learn for evaluation
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix
)

print("Libraries imported successfully!")

ModuleNotFoundError: No module named 'matplotlib'

## 2. Load Dataset

‡πÉ‡∏ä‡πâ **Wisesight Sentiment Corpus** - ‡∏Ç‡πâ‡∏≠‡∏Ñ‡∏ß‡∏≤‡∏°‡∏†‡∏≤‡∏©‡∏≤‡πÑ‡∏ó‡∏¢‡∏à‡∏≤‡∏Å Social Media ‡∏û‡∏£‡πâ‡∏≠‡∏° labels:
- `pos` (positive)
- `neu` (neutral)
- `neg` (negative)
- `q` (question) - ‡∏à‡∏∞‡∏ñ‡∏π‡∏Å‡∏Å‡∏£‡∏≠‡∏á‡∏≠‡∏≠‡∏Å

In [None]:
# Load Wisesight Sentiment dataset from Hugging Face
print("Loading Wisesight Sentiment dataset...")
dataset = load_dataset("wisesight_sentiment")

print(f"\nDataset structure:")
print(dataset)

print(f"\nTrain set size: {len(dataset['train'])}")
print(f"Validation set size: {len(dataset['validation'])}")
print(f"Test set size: {len(dataset['test'])}")

In [None]:
# Explore the dataset
print("Sample data from training set:")
print("=" * 60)

# Label mapping
label_names = {0: 'neg', 1: 'neu', 2: 'pos', 3: 'q'}

for i in range(5):
    sample = dataset['train'][i]
    print(f"\nText: {sample['texts'][:100]}...")
    print(f"Label: {label_names[sample['category']]}")
    print("-" * 40)

In [None]:
# Check label distribution
train_labels = [label_names[x['category']] for x in dataset['train']]
test_labels = [label_names[x['category']] for x in dataset['test']]

print("Label distribution in training set:")
print(pd.Series(train_labels).value_counts())

print("\nLabel distribution in test set:")
print(pd.Series(test_labels).value_counts())

## 3. Data Preprocessing

‡∏Å‡∏£‡∏≠‡∏á‡πÄ‡∏≠‡∏≤‡πÄ‡∏â‡∏û‡∏≤‡∏∞ 3 classes: positive, neutral, negative (‡πÑ‡∏°‡πà‡∏£‡∏ß‡∏° question)

In [None]:
def filter_and_prepare_data(dataset_split):
    """
    Filter out 'question' category and prepare data for evaluation.
    
    Args:
        dataset_split: HuggingFace dataset split
    
    Returns:
        texts: List of text strings
        labels: List of label strings (neg, neu, pos)
    """
    texts = []
    labels = []
    
    label_map = {0: 'negative', 1: 'neutral', 2: 'positive', 3: 'question'}
    
    for item in dataset_split:
        label = label_map[item['category']]
        # Skip question category
        if label != 'question':
            texts.append(item['texts'])
            labels.append(label)
    
    return texts, labels


# Prepare test data (filter out questions)
test_texts, test_labels = filter_and_prepare_data(dataset['test'])

print(f"Test samples after filtering: {len(test_texts)}")
print(f"\nLabel distribution after filtering:")
print(pd.Series(test_labels).value_counts())

## 4. Load Pre-trained Model

‡πÉ‡∏ä‡πâ **WangchanBERTa** ‡∏ó‡∏µ‡πà‡∏ñ‡∏π‡∏Å fine-tuned ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö sentiment analysis ‡∏ö‡∏ô Wisesight dataset

Model ‡∏ô‡∏µ‡πâ‡∏°‡∏µ accuracy ~91% ‡∏ö‡∏ô Wisesight dataset

In [None]:
# Load pre-trained sentiment model
MODEL_NAME = "poom-sci/WangchanBERTa-finetuned-sentiment"

print(f"Loading model: {MODEL_NAME}")
print("This may take a minute...")

# Create sentiment analysis pipeline
sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model=MODEL_NAME,
    tokenizer=MODEL_NAME,
    max_length=512,
    truncation=True
)

print("Model loaded successfully!")

In [None]:
# Test the model with sample texts
sample_texts = [
    "‡∏≠‡∏≤‡∏´‡∏≤‡∏£‡∏≠‡∏£‡πà‡∏≠‡∏¢‡∏°‡∏≤‡∏Å ‡∏ö‡∏£‡∏¥‡∏Å‡∏≤‡∏£‡∏î‡∏µ‡πÄ‡∏¢‡∏µ‡πà‡∏¢‡∏° ‡πÅ‡∏ô‡∏∞‡∏ô‡∏≥‡πÄ‡∏•‡∏¢‡∏Ñ‡∏£‡∏±‡∏ö",
    "‡∏£‡πâ‡∏≤‡∏ô‡∏ô‡∏µ‡πâ‡∏ò‡∏£‡∏£‡∏°‡∏î‡∏≤‡∏°‡∏≤‡∏Å ‡πÑ‡∏°‡πà‡∏°‡∏µ‡∏≠‡∏∞‡πÑ‡∏£‡∏û‡∏¥‡πÄ‡∏®‡∏©",
    "‡πÅ‡∏¢‡πà‡∏°‡∏≤‡∏Å ‡∏£‡∏≠‡∏ô‡∏≤‡∏ô‡∏°‡∏≤‡∏Å ‡∏≠‡∏≤‡∏´‡∏≤‡∏£‡πÑ‡∏°‡πà‡∏≠‡∏£‡πà‡∏≠‡∏¢ ‡πÑ‡∏°‡πà‡∏°‡∏≤‡∏≠‡∏µ‡∏Å‡πÅ‡∏•‡πâ‡∏ß",
    "‡∏Ç‡∏≠‡∏á‡∏î‡∏µ‡∏£‡∏≤‡∏Ñ‡∏≤‡∏ñ‡∏π‡∏Å ‡∏Ñ‡∏∏‡πâ‡∏°‡∏Ñ‡πà‡∏≤‡∏°‡∏≤‡∏Å‡πÜ",
    "‡∏ú‡∏¥‡∏î‡∏´‡∏ß‡∏±‡∏á‡∏°‡∏≤‡∏Å ‡πÑ‡∏°‡πà‡πÄ‡∏´‡∏°‡∏∑‡∏≠‡∏ô‡πÉ‡∏ô‡∏£‡∏π‡∏õ‡πÄ‡∏•‡∏¢"
]

print("Testing model with sample texts:")
print("=" * 60)

for text in sample_texts:
    result = sentiment_analyzer(text)[0]
    print(f"\nText: {text}")
    print(f"Prediction: {result['label']} (confidence: {result['score']:.4f})")

## 5. Run Predictions on Test Set

‡∏ó‡∏≥‡∏ô‡∏≤‡∏¢‡∏ú‡∏•‡∏ö‡∏ô test set ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏õ‡∏£‡∏∞‡πÄ‡∏°‡∏¥‡∏ô‡∏õ‡∏£‡∏∞‡∏™‡∏¥‡∏ó‡∏ò‡∏¥‡∏†‡∏≤‡∏û‡∏Ç‡∏≠‡∏á model

In [None]:
def predict_sentiment_batch(texts: List[str], batch_size: int = 32) -> List[str]:
    """
    Predict sentiment for a list of texts in batches.
    
    Args:
        texts: List of text strings
        batch_size: Number of texts to process at once
    
    Returns:
        List of predicted labels
    """
    predictions = []
    total = len(texts)
    
    for i in range(0, total, batch_size):
        batch = texts[i:i+batch_size]
        results = sentiment_analyzer(batch)
        
        for result in results:
            # Map model output to standard labels
            label = result['label'].lower()
            if 'pos' in label:
                predictions.append('positive')
            elif 'neg' in label:
                predictions.append('negative')
            else:
                predictions.append('neutral')
        
        # Progress update
        if (i + batch_size) % 100 == 0 or (i + batch_size) >= total:
            print(f"Progress: {min(i + batch_size, total)}/{total} samples processed")
    
    return predictions


# Run predictions on test set
print("Running predictions on test set...")
print("=" * 60)

predicted_labels = predict_sentiment_batch(test_texts)

print(f"\nPrediction complete! Total predictions: {len(predicted_labels)}")

## 6. Evaluation Metrics

‡∏Ñ‡∏≥‡∏ô‡∏ß‡∏ì metrics ‡∏ï‡πà‡∏≤‡∏á‡πÜ ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏õ‡∏£‡∏∞‡πÄ‡∏°‡∏¥‡∏ô‡∏õ‡∏£‡∏∞‡∏™‡∏¥‡∏ó‡∏ò‡∏¥‡∏†‡∏≤‡∏û:
- **Accuracy**: ‡∏™‡∏±‡∏î‡∏™‡πà‡∏ß‡∏ô‡∏ó‡∏µ‡πà‡∏ó‡∏≥‡∏ô‡∏≤‡∏¢‡∏ñ‡∏π‡∏Å
- **Precision**: ‡∏Ñ‡∏ß‡∏≤‡∏°‡πÅ‡∏°‡πà‡∏ô‡∏¢‡∏≥‡∏Ç‡∏≠‡∏á‡∏Å‡∏≤‡∏£‡∏ó‡∏≥‡∏ô‡∏≤‡∏¢ positive
- **Recall**: ‡∏Ñ‡∏ß‡∏≤‡∏°‡∏Ñ‡∏£‡∏ö‡∏ñ‡πâ‡∏ß‡∏ô‡πÉ‡∏ô‡∏Å‡∏≤‡∏£‡∏à‡∏±‡∏ö positive ‡∏à‡∏£‡∏¥‡∏á
- **F1-score**: Harmonic mean ‡∏Ç‡∏≠‡∏á precision ‡πÅ‡∏•‡∏∞ recall

In [None]:
# Calculate metrics
accuracy = accuracy_score(test_labels, predicted_labels)
precision = precision_score(test_labels, predicted_labels, average='weighted')
recall = recall_score(test_labels, predicted_labels, average='weighted')
f1 = f1_score(test_labels, predicted_labels, average='weighted')

print("=" * 60)
print("EVALUATION METRICS")
print("=" * 60)
print(f"\nAccuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-score:  {f1:.4f}")

In [None]:
# Detailed classification report
print("\n" + "=" * 60)
print("CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(test_labels, predicted_labels, digits=4))

## 7. Confusion Matrix

‡πÅ‡∏™‡∏î‡∏á Confusion Matrix ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏î‡∏π‡∏£‡∏≤‡∏¢‡∏•‡∏∞‡πÄ‡∏≠‡∏µ‡∏¢‡∏î‡∏Å‡∏≤‡∏£‡∏ó‡∏≥‡∏ô‡∏≤‡∏¢

In [None]:
# Create confusion matrix
labels_order = ['negative', 'neutral', 'positive']
cm = confusion_matrix(test_labels, predicted_labels, labels=labels_order)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(
    cm, 
    annot=True, 
    fmt='d', 
    cmap='Blues',
    xticklabels=labels_order,
    yticklabels=labels_order
)
plt.title('Confusion Matrix - Sentiment Analysis', fontsize=14)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()

print("Confusion matrix saved to: confusion_matrix.png")

In [None]:
# Normalized confusion matrix (percentage)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(10, 8))
sns.heatmap(
    cm_normalized, 
    annot=True, 
    fmt='.2%', 
    cmap='Greens',
    xticklabels=labels_order,
    yticklabels=labels_order
)
plt.title('Normalized Confusion Matrix (Percentage)', fontsize=14)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.savefig('confusion_matrix_normalized.png', dpi=150)
plt.show()

## 8. Error Analysis

‡∏ß‡∏¥‡πÄ‡∏Ñ‡∏£‡∏≤‡∏∞‡∏´‡πå‡∏ï‡∏±‡∏ß‡∏≠‡∏¢‡πà‡∏≤‡∏á‡∏ó‡∏µ‡πà model ‡∏ó‡∏≥‡∏ô‡∏≤‡∏¢‡∏ú‡∏¥‡∏î

In [None]:
# Find misclassified examples
errors = []
for i, (text, true_label, pred_label) in enumerate(zip(test_texts, test_labels, predicted_labels)):
    if true_label != pred_label:
        errors.append({
            'text': text[:100] + '...' if len(text) > 100 else text,
            'true_label': true_label,
            'predicted_label': pred_label
        })

print(f"Total misclassified samples: {len(errors)} / {len(test_texts)} ({len(errors)/len(test_texts)*100:.2f}%)")

# Show some error examples
print("\n" + "=" * 60)
print("SAMPLE MISCLASSIFICATIONS")
print("=" * 60)

for i, error in enumerate(errors[:10]):
    print(f"\n[{i+1}] Text: {error['text']}")
    print(f"    True: {error['true_label']} | Predicted: {error['predicted_label']}")

## 9. Interactive Demo

‡∏ó‡∏î‡∏™‡∏≠‡∏ö model ‡∏Å‡∏±‡∏ö‡∏Ç‡πâ‡∏≠‡∏Ñ‡∏ß‡∏≤‡∏°‡πÉ‡∏´‡∏°‡πà

In [None]:
def analyze_sentiment(text: str) -> Dict:
    """
    Analyze sentiment of a single text.
    
    Args:
        text: Input text string
    
    Returns:
        Dictionary with sentiment and confidence
    """
    result = sentiment_analyzer(text)[0]
    
    # Map to standard labels
    label = result['label'].lower()
    if 'pos' in label:
        sentiment = 'positive'
    elif 'neg' in label:
        sentiment = 'negative'
    else:
        sentiment = 'neutral'
    
    return {
        'text': text,
        'sentiment': sentiment,
        'confidence': result['score']
    }


# Test with custom texts
custom_texts = [
    "‡∏™‡∏¥‡∏ô‡∏Ñ‡πâ‡∏≤‡∏Ñ‡∏∏‡∏ì‡∏†‡∏≤‡∏û‡∏î‡∏µ‡∏°‡∏≤‡∏Å ‡∏™‡πà‡∏á‡πÄ‡∏£‡πá‡∏ß‡∏°‡∏≤‡∏Å‡πÜ ‡∏õ‡∏£‡∏∞‡∏ó‡∏±‡∏ö‡πÉ‡∏à‡∏Ñ‡πà‡∏∞",
    "‡∏õ‡∏Å‡∏ï‡∏¥‡∏ò‡∏£‡∏£‡∏°‡∏î‡∏≤ ‡πÑ‡∏°‡πà‡πÑ‡∏î‡πâ‡∏£‡∏π‡πâ‡∏™‡∏∂‡∏Å‡∏ß‡πà‡∏≤‡∏î‡∏µ‡∏´‡∏£‡∏∑‡∏≠‡πÅ‡∏¢‡πà",
    "‡∏´‡πà‡∏ß‡∏¢‡πÅ‡∏ï‡∏Å‡∏°‡∏≤‡∏Å ‡πÑ‡∏°‡πà‡∏Ñ‡∏∏‡πâ‡∏°‡∏£‡∏≤‡∏Ñ‡∏≤‡πÄ‡∏•‡∏¢ ‡∏≠‡∏¢‡πà‡∏≤‡∏ã‡∏∑‡πâ‡∏≠",
    "‡∏û‡∏ô‡∏±‡∏Å‡∏á‡∏≤‡∏ô‡∏ô‡πà‡∏≤‡∏£‡∏±‡∏Å‡∏°‡∏≤‡∏Å ‡∏ä‡πà‡∏ß‡∏¢‡πÄ‡∏´‡∏•‡∏∑‡∏≠‡∏î‡∏µ",
    "‡∏£‡∏™‡∏ä‡∏≤‡∏ï‡∏¥‡∏û‡∏≠‡πÉ‡∏ä‡πâ‡πÑ‡∏î‡πâ ‡πÅ‡∏ï‡πà‡∏£‡∏≤‡∏Ñ‡∏≤‡πÅ‡∏û‡∏á‡πÑ‡∏õ"
]

print("=" * 60)
print("SENTIMENT ANALYSIS DEMO")
print("=" * 60)

for text in custom_texts:
    result = analyze_sentiment(text)
    emoji = {'positive': 'üòä', 'neutral': 'üòê', 'negative': 'üòû'}[result['sentiment']]
    print(f"\n{emoji} [{result['sentiment'].upper()}] (conf: {result['confidence']:.2%})")
    print(f"   \"{text}\"")

## 10. Summary

‡∏™‡∏£‡∏∏‡∏õ‡∏ú‡∏•‡∏Å‡∏≤‡∏£‡∏ó‡∏î‡∏™‡∏≠‡∏ö Sentiment Analysis

In [None]:
print("=" * 60)
print("SENTIMENT ANALYSIS SUMMARY")
print("=" * 60)

print(f"\nüìä Dataset: Wisesight Sentiment Corpus")
print(f"ü§ñ Model: {MODEL_NAME}")
print(f"üìù Test samples: {len(test_texts)}")

print(f"\nüìà Performance Metrics:")
print(f"   ‚Ä¢ Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"   ‚Ä¢ Precision: {precision:.4f}")
print(f"   ‚Ä¢ Recall:    {recall:.4f}")
print(f"   ‚Ä¢ F1-score:  {f1:.4f}")

print(f"\nüìÅ Output files:")
print(f"   ‚Ä¢ confusion_matrix.png")
print(f"   ‚Ä¢ confusion_matrix_normalized.png")

print("\n" + "=" * 60)

In [None]:
# Create results DataFrame
results_df = pd.DataFrame({
    'text': test_texts,
    'true_label': test_labels,
    'predicted_label': predicted_labels
})

# Save to CSV
results_df.to_csv('sentiment_predictions.csv', index=False, encoding='utf-8-sig')
print("Predictions saved to: sentiment_predictions.csv")

# Display sample
results_df.head(10)