# Myanmar News Scraping Pipeline Walkthrough

This notebook explains the web scraping component of our Myanmar news classification system. We collect articles from three news sources with different political leanings to create a balanced dataset.

## Environment Setup
**Conda Environment:** nlp  
**Purpose:** Collect Myanmar news articles for sentiment classification training

## Overview
Our scraping system targets three Myanmar news websites:
- **DVB News** (Democratic Voice of Burma) - Neutral/Opposition perspective
- **Myawady News** - Government-aligned perspective  
- **Khitthit News** - Independent/Critical perspective

This creates a diverse dataset representing different political viewpoints in Myanmar media.

## 1. Core Scraping Architecture

### Base Scraper Class Design
Our scraping system uses a modular approach with a base scraper class that each news source extends:

In [None]:
import requests
from bs4 import BeautifulSoup
import json
import time
from datetime import datetime
import re

class BaseScraper:
    """
    Base scraper class providing common functionality for all news sources.
    
    Key Design Principles:
    - Respectful scraping with delays
    - Error handling for network issues
    - Unicode text cleaning for Myanmar content
    - Structured data output in JSON format
    """
    
    def __init__(self, base_url, delay=1.0):
        """
        Initialize scraper with base configuration.
        
        Args:
            base_url (str): Base URL of the news website
            delay (float): Delay between requests in seconds
        """
        self.base_url = base_url
        self.delay = delay
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; NewsResearchBot/1.0)'
        })
    
    def clean_myanmar_text(self, text):
        """
        Clean Myanmar text content.
        
        Why this is needed:
        - Myanmar websites often have encoding issues
        - Mixed Unicode ranges (Myanmar + English + symbols)
        - Extra whitespace and formatting artifacts
        
        Returns:
            str: Cleaned text suitable for NLP processing
        """
        if not text:
            return ""
        
        # Remove HTML entities and extra whitespace
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
        
        # Remove special characters but preserve Myanmar script
        # Myanmar Unicode range: U+1000–U+109F
        text = re.sub(r'[^\u1000-\u109F\u0020-\u007E\s]', ' ', text)
        
        return text
    
    def scrape_article_list(self, max_articles=100):
        """
        Get list of article URLs from the main page.
        Each subclass implements this based on site structure.
        """
        raise NotImplementedError("Subclasses must implement this method")
    
    def scrape_article_content(self, article_url):
        """
        Extract content from individual article page.
        Each subclass implements this based on site structure.
        """
        raise NotImplementedError("Subclasses must implement this method")

print("✅ Base scraper architecture defined")

## 2. DVB News Scraper Implementation

DVB News provides neutral/opposition perspective content. Key challenges:
- Dynamic content loading
- Mixed language content (Myanmar + English)
- Anti-bot measures requiring careful request handling

In [None]:
class DVBNewsScraper(BaseScraper):
    """
    DVB News scraper implementation.
    
    Website characteristics:
    - WordPress-based structure
    - Articles in both Myanmar and English
    - Pagination-based article listing
    """
    
    def __init__(self):
        super().__init__('https://www.dvb.no/', delay=2.0)  # Slower to be respectful
        self.article_selectors = {
            'title': 'h1.entry-title',
            'content': 'div.entry-content p',
            'date': 'time.entry-date'
        }
    
    def extract_article_links(self, soup):
        """
        Extract article links from DVB main page.
        
        Strategy:
        - Target article title links
        - Filter Myanmar language articles
        - Exclude non-news content (ads, categories)
        
        Returns:
            list: URLs of Myanmar articles
        """
        article_links = []
        
        # Find all article links
        links = soup.find_all('a', href=True)
        
        for link in links:
            href = link.get('href')
            title = link.get_text(strip=True)
            
            # Check if this looks like a Myanmar article
            if self.is_myanmar_article(href, title):
                full_url = href if href.startswith('http') else self.base_url + href
                article_links.append(full_url)
        
        return list(set(article_links))  # Remove duplicates
    
    def is_myanmar_article(self, url, title):
        """
        Determine if article is Myanmar content.
        
        Logic:
        - Check for Myanmar Unicode characters in title
        - Exclude category/tag pages
        - Exclude admin/wp-content URLs
        """
        # Skip non-article URLs
        skip_patterns = ['/category/', '/tag/', '/wp-', '/admin/', '/?']
        if any(pattern in url for pattern in skip_patterns):
            return False
        
        # Check for Myanmar script in title
        myanmar_pattern = r'[\u1000-\u109F]'
        return bool(re.search(myanmar_pattern, title))
    
    def scrape_article_content(self, article_url):
        """
        Extract full content from DVB article page.
        
        Returns:
            dict: Article data with title, content, metadata
        """
        try:
            response = self.session.get(article_url, timeout=30)
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Extract article components
            title = self.extract_title(soup)
            content = self.extract_content(soup)
            
            # Basic validation
            if not title or len(content) < 50:
                return None
            
            return {
                'url': article_url,
                'title': self.clean_myanmar_text(title),
                'content': self.clean_myanmar_text(content),
                'scraped_at': datetime.now().isoformat(),
                'source': 'dvb'
            }
            
        except Exception as e:
            print(f"Error scraping {article_url}: {e}")
            return None
    
    def extract_title(self, soup):
        """Extract article title using CSS selectors."""
        title_elem = soup.select_one(self.article_selectors['title'])
        return title_elem.get_text(strip=True) if title_elem else ""
    
    def extract_content(self, soup):
        """Extract article content paragraphs."""
        content_elems = soup.select(self.article_selectors['content'])
        paragraphs = [elem.get_text(strip=True) for elem in content_elems]
        return ' '.join(paragraphs)

print("✅ DVB scraper implementation complete")

## 3. Data Collection Strategy

### Why Three Sources?
Myanmar's media landscape requires diverse data collection:

1. **DVB** - Opposition/exile media perspective
2. **Myawady** - State-controlled media perspective
3. **Khitthit** - Independent critical journalism

This ensures our model learns to classify across the political spectrum rather than just technical writing differences.

In [None]:
def collect_balanced_dataset(articles_per_source=100):
    """
    Collect balanced dataset from all three sources.
    
    Strategy:
    - Equal articles from each source
    - Quality filtering (length, language detection)
    - Metadata preservation for later labeling
    
    Args:
        articles_per_source (int): Number of articles to collect per source
    
    Returns:
        dict: Combined dataset with metadata
    """
    
    # Initialize scrapers for all sources
    scrapers = {
        'dvb': DVBNewsScraper(),
        'myawady': MyAwadyNewsScraper(),  # Similar implementation
        'khitthit': KhitthitNewsScraper()  # Similar implementation
    }
    
    collected_data = {
        'articles': [],
        'metadata': {
            'collection_date': datetime.now().isoformat(),
            'sources': list(scrapers.keys()),
            'target_per_source': articles_per_source
        },
        'stats': {}
    }
    
    for source_name, scraper in scrapers.items():
        print(f"\n🔍 Collecting from {source_name.upper()}...")
        
        # Get article URLs
        article_urls = scraper.scrape_article_list(max_articles=articles_per_source * 2)
        print(f"   Found {len(article_urls)} potential articles")
        
        # Scrape article content
        articles_collected = 0
        for i, url in enumerate(article_urls[:articles_per_source * 2]):
            if articles_collected >= articles_per_source:
                break
                
            article_data = scraper.scrape_article_content(url)
            
            if article_data and validate_article_quality(article_data):
                collected_data['articles'].append(article_data)
                articles_collected += 1
                
                if articles_collected % 10 == 0:
                    print(f"   Collected {articles_collected}/{articles_per_source}")
            
            # Respectful delay
            time.sleep(scraper.delay)
        
        collected_data['stats'][source_name] = articles_collected
        print(f"   ✅ Completed: {articles_collected} articles from {source_name}")
    
    return collected_data

def validate_article_quality(article_data):
    """
    Quality validation for scraped articles.
    
    Criteria:
    - Minimum content length (meaningful articles)
    - Myanmar script presence (language filtering)
    - No duplicate content
    
    Args:
        article_data (dict): Article with title, content, etc.
    
    Returns:
        bool: True if article meets quality standards
    """
    content = article_data.get('content', '')
    title = article_data.get('title', '')
    
    # Length requirements
    if len(content) < 100:  # Minimum meaningful content
        return False
    
    if len(title) < 5:  # Minimum title length
        return False
    
    # Myanmar script requirement
    myanmar_pattern = r'[\u1000-\u109F]'
    if not re.search(myanmar_pattern, content + title):
        return False
    
    # Content ratio check (avoid pages with mostly navigation/ads)
    words = content.split()
    if len(words) < 20:  # Minimum word count
        return False
    
    return True

print("✅ Data collection strategy defined")

## 4. Output Format and Structure

Our scraping system outputs structured JSON data that feeds into the next pipeline stage.

In [None]:
# Example output structure
sample_scraped_data = {
    "articles": [
        {
            "url": "https://www.dvb.no/article/example-url",
            "title": "မြန်မာ နိုင်ငံတော် ရေးရာ ခေါင်းစဉ်",  # Myanmar title
            "content": "မြန်မာ ဘာသာ ဖြင့် ရေးသား ထား သော ဆောင်းပါး အကြောင်းအရာ...",  # Myanmar content
            "scraped_at": "2025-08-23T10:30:00",
            "source": "dvb",
            "word_count": 245,
            "language_detected": "my"
        }
    ],
    "metadata": {
        "collection_date": "2025-08-23T10:30:00",
        "total_articles": 300,
        "sources": ["dvb", "myawady", "khitthit"],
        "collection_stats": {
            "dvb": {"attempted": 150, "successful": 100, "success_rate": 0.67},
            "myawady": {"attempted": 120, "successful": 100, "success_rate": 0.83},
            "khitthit": {"attempted": 130, "successful": 100, "success_rate": 0.77}
        }
    }
}

def save_scraped_data(data, output_dir):
    """
    Save scraped data in multiple formats for pipeline consumption.
    
    Outputs:
    1. Raw JSON - Complete data for debugging
    2. Training text - Clean text for model training
    3. Statistics - Collection metrics
    
    Args:
        data (dict): Scraped article data
        output_dir (str): Directory to save files
    """
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    
    # 1. Save complete raw data
    raw_file = f"{output_dir}/raw_articles_{timestamp}.json"
    with open(raw_file, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    
    # 2. Save training-ready text
    training_file = f"{output_dir}/training_text_{timestamp}.txt"
    with open(training_file, 'w', encoding='utf-8') as f:
        for article in data['articles']:
            # Format: title + content for each article
            text = f"{article['title']} {article['content']}"
            f.write(text + '\n')
    
    # 3. Save collection statistics
    stats_file = f"{output_dir}/stats_{timestamp}.json"
    with open(stats_file, 'w', encoding='utf-8') as f:
        json.dump(data['metadata'], f, ensure_ascii=False, indent=2)
    
    print(f"✅ Data saved:")
    print(f"   Raw: {raw_file}")
    print(f"   Training: {training_file}")
    print(f"   Stats: {stats_file}")

print("✅ Output format structure defined")

## 5. Key Implementation Decisions

### Why These Choices?

1. **Respectful Scraping:**
   - 1-2 second delays between requests
   - Proper User-Agent headers
   - Session management for efficiency
   - Error handling to avoid crashes

2. **Myanmar Text Handling:**
   - Unicode normalization for consistent text
   - Myanmar script detection (U+1000–U+109F)
   - Mixed language support
   - Encoding preservation for NLP

3. **Quality Control:**
   - Minimum content length filtering
   - Language detection validation
   - Duplicate detection and removal
   - Metadata preservation for debugging

4. **Scalable Architecture:**
   - Base class for common functionality
   - Source-specific implementations
   - Configurable parameters
   - Error recovery mechanisms

## 6. Integration with Pipeline

The scraping stage feeds into our data processing pipeline:

```
Scraping → Cleaning → Preprocessing → Tokenization → Labeling → Training
```

**Output Files:**
- Raw JSON files move to `data/raw/to_process/`
- Training text files used for manual review
- Statistics help monitor collection quality

**Next Stage:** Data cleaning removes HTML artifacts, normalizes Unicode, and prepares text for NLP processing.