# üáªüá≥ VeryGoodMail - PhoBERT Email Classification Training

Notebook n√†y gi√∫p b·∫°n train c√°c model PhoBERT cho:
- **Spam Detection**: Ph√°t hi·ªán email spam
- **Sentiment Analysis**: Ph√¢n t√≠ch c·∫£m x√∫c 
- **Category Classification**: Ph√¢n lo·∫°i email

¬© 2025 VeryGoodMail by Ho√†n

## 1. Setup Environment

In [None]:
# Install required packages
!pip install transformers torch datasets scikit-learn pandas underthesea -q

In [None]:
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')
if device == 'cuda':
    print(f'GPU: {torch.cuda.get_device_name(0)}')

## 2. Load PhoBERT Tokenizer

In [None]:
# Load PhoBERT tokenizer
model_name = "vinai/phobert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f'Loaded tokenizer: {model_name}')

## 3. Prepare Dataset

Upload dataset c·ªßa b·∫°n ho·∫∑c s·ª≠ d·ª•ng sample data

In [None]:
# Sample data - Thay th·∫ø b·∫±ng dataset c·ªßa b·∫°n
# Format: text, label

# Spam detection dataset
spam_data = [
    # Spam examples (label=1)
    ("Ch√∫c m·ª´ng! B·∫°n ƒë√£ tr√∫ng th∆∞·ªüng 100 tri·ªáu. Click ngay!", 1),
    ("Ki·∫øm ti·ªÅn online d·ªÖ d√†ng, thu nh·∫≠p 50tr/th√°ng", 1),
    ("Gi·∫£m c√¢n nhanh ch√≥ng kh√¥ng c·∫ßn t·∫≠p luy·ªán", 1),
    ("Free gift! Claim your prize now!", 1),
    ("Khuy·∫øn m√£i ƒë·∫∑c bi·ªát ch·ªâ h√¥m nay, gi·∫£m 90%!", 1),
    ("You have won a lottery! Click here!", 1),
    # Ham examples (label=0)
    ("Cu·ªôc h·ªçp ƒë∆∞·ª£c l√™n l·ªãch v√†o ng√†y mai l√∫c 10 gi·ªù", 0),
    ("Vui l√≤ng xem x√©t t√†i li·ªáu ƒë√≠nh k√®m", 0),
    ("C·∫£m ∆°n email c·ªßa b·∫°n v·ªÅ d·ª± √°n", 0),
    ("Meeting scheduled for tomorrow at 10 AM", 0),
    ("Please review the attached document", 0),
    ("Thank you for your email regarding the project", 0),
]

# Sentiment dataset
sentiment_data = [
    # Positive (label=2)
    ("C·∫£m ∆°n b·∫°n r·∫•t nhi·ªÅu! D·ªãch v·ª• tuy·ªát v·ªùi!", 2),
    ("T√¥i r·∫•t h√†i l√≤ng v·ªõi s·∫£n ph·∫©m n√†y", 2),
    ("Great job! Thanks for your help!", 2),
    # Neutral (label=1)
    ("T√¥i mu·ªën h·ªèi v·ªÅ ƒë∆°n h√†ng c·ªßa m√¨nh", 1),
    ("Xin cho t√¥i bi·∫øt th√™m th√¥ng tin", 1),
    ("I would like to inquire about my order", 1),
    # Negative (label=0)
    ("D·ªãch v·ª• r·∫•t t·ªá, t√¥i r·∫•t th·∫•t v·ªçng", 0),
    ("S·∫£n ph·∫©m b·ªã l·ªói, y√™u c·∫ßu ho√†n ti·ªÅn", 0),
    ("This is terrible service. I want a refund.", 0),
]

# Category dataset
category_data = [
    # Primary (label=0)
    ("Cu·ªôc h·ªçp v√†o l√∫c 3 gi·ªù chi·ªÅu nay", 0),
    ("Please send me the report by EOD", 0),
    # Important (label=1)
    ("KH·∫®N C·∫§P: C·∫ßn ph·∫£n h·ªìi ngay l·∫≠p t·ª©c", 1),
    ("URGENT: Your account needs verification", 1),
    # Social (label=2)
    ("Ai ƒë√≥ ƒë√£ th√≠ch b√†i vi·∫øt c·ªßa b·∫°n", 2),
    ("You have a new friend request", 2),
    # Promotions (label=3)
    ("Gi·∫£m gi√° 50% t·∫•t c·∫£ s·∫£n ph·∫©m", 3),
    ("Summer sale - 50% off everything!", 3),
    # Updates (label=4)
    ("ƒê∆°n h√†ng c·ªßa b·∫°n ƒë√£ ƒë∆∞·ª£c giao", 4),
    ("Your package has been delivered", 4),
]

# Convert to DataFrames
df_spam = pd.DataFrame(spam_data, columns=['text', 'label'])
df_sentiment = pd.DataFrame(sentiment_data, columns=['text', 'label'])
df_category = pd.DataFrame(category_data, columns=['text', 'label'])

print(f'Spam dataset: {len(df_spam)} samples')
print(f'Sentiment dataset: {len(df_sentiment)} samples')
print(f'Category dataset: {len(df_category)} samples')

In [None]:
# Upload your own dataset (optional)
# from google.colab import files
# uploaded = files.upload()
# df_spam = pd.read_csv('your_spam_data.csv')

## 4. Tokenize Data

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=256
    )

def prepare_dataset(df):
    """Convert DataFrame to HuggingFace Dataset"""
    dataset = Dataset.from_pandas(df)
    tokenized = dataset.map(tokenize_function, batched=True)
    return tokenized

# Prepare datasets
spam_dataset = prepare_dataset(df_spam)
sentiment_dataset = prepare_dataset(df_sentiment)
category_dataset = prepare_dataset(df_category)

print('Datasets prepared!')

## 5. Train Spam Detection Model

In [None]:
def train_model(dataset, num_labels, output_dir, epochs=3):
    """Train a PhoBERT classification model"""
    
    # Load model
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels
    )
    
    # Split dataset
    split = dataset.train_test_split(test_size=0.2)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=100,
        weight_decay=0.01,
        logging_dir=f'{output_dir}/logs',
        logging_steps=10,
        eval_strategy='epoch',
        save_strategy='epoch',
        load_best_model_at_end=True,
    )
    
    # Data collator
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    
    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=split['train'],
        eval_dataset=split['test'],
        data_collator=data_collator,
    )
    
    # Train
    trainer.train()
    
    # Save model
    model.save_pretrained(output_dir)
    
    return model, trainer

# Train spam model (2 classes: ham=0, spam=1)
print('Training Spam Detection Model...')
spam_model, spam_trainer = train_model(
    spam_dataset, 
    num_labels=2, 
    output_dir='./spam_model',
    epochs=3
)
print('Spam model trained!')

## 6. Train Sentiment Analysis Model

In [None]:
# Train sentiment model (3 classes: negative=0, neutral=1, positive=2)
print('Training Sentiment Analysis Model...')
sentiment_model, sentiment_trainer = train_model(
    sentiment_dataset,
    num_labels=3,
    output_dir='./sentiment_model',
    epochs=3
)
print('Sentiment model trained!')

## 7. Train Category Classification Model

In [None]:
# Train category model (5 classes)
print('Training Category Classification Model...')
category_model, category_trainer = train_model(
    category_dataset,
    num_labels=5,
    output_dir='./category_model',
    epochs=3
)
print('Category model trained!')

## 8. Save Tokenizer

In [None]:
# Save tokenizer
tokenizer.save_pretrained('./tokenizer')
print('Tokenizer saved!')

## 9. Test Models

In [None]:
def test_model(model, text, label_map):
    """Test a single prediction"""
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=256)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    model.to(device)
    model.eval()
    
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        pred_class = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][pred_class].item()
    
    return label_map[pred_class], confidence

# Test spam detection
spam_labels = {0: 'Ham', 1: 'Spam'}
test_texts = [
    "Cu·ªôc h·ªçp v√†o 3 gi·ªù chi·ªÅu",
    "B·∫°n ƒë√£ tr√∫ng th∆∞·ªüng 1 t·ª∑ ƒë·ªìng!",
]
print('\n=== Spam Detection Test ===')
for text in test_texts:
    label, conf = test_model(spam_model, text, spam_labels)
    print(f'Text: "{text}"')
    print(f'Prediction: {label} (confidence: {conf:.2%})\n')

# Test sentiment
sentiment_labels = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}
test_texts = [
    "C·∫£m ∆°n b·∫°n r·∫•t nhi·ªÅu!",
    "D·ªãch v·ª• r·∫•t t·ªá",
]
print('=== Sentiment Analysis Test ===')
for text in test_texts:
    label, conf = test_model(sentiment_model, text, sentiment_labels)
    print(f'Text: "{text}"')
    print(f'Prediction: {label} (confidence: {conf:.2%})\n')

## 10. Download Models

In [None]:
# Zip and download all models
!zip -r models.zip spam_model sentiment_model category_model tokenizer

from google.colab import files
files.download('models.zip')

print('\n‚úÖ Download complete!')
print('Extract models.zip and copy to PhoBERT-Service/models/ directory')

## üìù Next Steps

1. Download file `models.zip`
2. Extract v√†o th∆∞ m·ª•c `PhoBERT-Service/models/`
3. C·∫•u tr√∫c th∆∞ m·ª•c:
   ```
   PhoBERT-Service/models/
   ‚îú‚îÄ‚îÄ spam_model/
   ‚îú‚îÄ‚îÄ sentiment_model/
   ‚îú‚îÄ‚îÄ category_model/
   ‚îî‚îÄ‚îÄ tokenizer/
   ```
4. Ch·∫°y PhoBERT service:
   ```bash
   cd PhoBERT-Service
   pip install -r requirements.txt
   uvicorn app.main:app --host 0.0.0.0 --port 8000
   ```
5. C·∫≠p nh·∫≠t Email-System-Server `.env`:
   ```
   PHOBERT_URL=http://localhost:8000
   ```