# Data Preparation Pipeline - Myanmar News Classification

This notebook walks through the complete data preparation pipeline for Myanmar news text classification. The pipeline transforms raw scraped articles into training-ready datasets through multiple stages of cleaning, preprocessing, and labeling.

## Pipeline Overview

The data preparation consists of 4 main stages:

1. **Data Cleaning**: Remove HTML, normalize text, validate content
2. **Preprocessing**: Structure validation, text normalization
3. **Tokenization**: Myanmar word segmentation and analysis
4. **Labeling & Dataset Creation**: Classification labels and train/val splits

Each stage addresses specific challenges in Myanmar NLP and creates progressively refined datasets.

## Stage 1: Data Cleaning

### Purpose
Remove HTML artifacts, normalize Unicode, and standardize text format from scraped content.

### Key Operations
- HTML tag removal and entity decoding
- Unicode normalization (NFC form)
- Content length validation
- Myanmar script validation
- Duplicate detection and removal

### Input/Output
- **Input**: `data/scraped/raw_articles.csv` (2,847 articles)
- **Output**: `data/cleaned/cleaned_articles.csv` (2,653 articles)
- **Removed**: 194 articles (too short, invalid script, duplicates)

## Stage 2: Text Preprocessing

### Purpose
Standardize text structure and prepare for tokenization.

### Key Operations
- Sentence boundary detection
- Whitespace normalization
- Punctuation standardization
- Text segmentation validation

### Processing Results
- **Input**: 2,653 cleaned articles
- **Output**: 2,653 preprocessed articles
- **Quality metrics**: Average sentence length, paragraph structure

## Stage 3: Tokenization

### Purpose
Segment Myanmar text into meaningful tokens for model input.

### Tokenization Approach
- **Method**: pyidaungsu Myanmar word segmenter
- **Token filtering**: Remove punctuation-only tokens
- **Length thresholds**: 10-500 tokens per article
- **Vocabulary analysis**: Token frequency and distribution

### Results
- **Vocabulary size**: 45,231 unique tokens
- **Average tokens per article**: 156
- **Token length distribution**: Well-balanced for model training

## Stage 4: Labeling & Dataset Creation

### Classification Schema
- **Neutral (0)**: General news, balanced reporting
- **Red (1)**: Government-positive, military-supportive content
- **Green (2)**: Opposition-positive, democracy-supportive content

### Labeling Process
- **Method**: Source-based automatic labeling
- **Validation**: Manual spot-checking of samples
- **Balance**: Stratified sampling for equal representation

### Final Dataset
- **Total articles**: 2,400 (800 per class)
- **Train split**: 1,920 articles (80%)
- **Validation split**: 480 articles (20%)
- **Features**: full_text, tokens, label_numeric, token_count

## Data Quality Metrics

### Text Quality
- **Language purity**: >95% Myanmar script
- **Content completeness**: Average 156 tokens per article
- **Duplication rate**: <1% after cleaning

### Label Distribution
- **Class balance**: Perfect 33.3% per class
- **Source diversity**: Multiple news sources per class
- **Temporal coverage**: 6 months of articles

### Token Statistics
- **Vocabulary coverage**: 99.2% of tokens in training
- **OOV rate**: 0.8% for validation set
- **Token density**: Consistent across all classes

## Pipeline Summary

The data preparation pipeline successfully processed 2,847 raw articles into a high-quality training dataset of 2,400 balanced samples. Key achievements:

### Processing Efficiency
- **Cleaning:** Content length, Myanmar script validation
- **Preprocessing:** Sentence structure, text normalization  
- **Tokenization:** Token count thresholds, vocabulary analysis
- **Labeling:** Balanced dataset creation, train/val splits

### Output Format for Training
Final CSV contains:
- `full_text`: Complete article (title + content)
- `tokens`: Space-separated Myanmar tokens
- `label_numeric`: 0=neutral, 1=red, 2=green
- `token_count`: Number of tokens (for model input sizing)
- Source and metadata for analysis

This structured approach ensures high-quality training data for our BiLSTM Myanmar news classification model.