# News Watch: Indonesia's News Websites Scraper

This notebook demonstrates how to use the **news-watch** package to scrape Indonesian news articles. The package supports various command-line arguments that can be used to customize your scraping process.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/okkymabruri/news-watch/blob/main/notebook/run-newswatch-on-colab.ipynb)

## 📋 Overview

The **news-watch** package enables you to:
- 🔍 Scrape articles from 14+ major Indonesian news websites
- 🎯 Filter articles based on keywords
- 📅 Specify a start date for scraping
- 📊 Choose output formats (CSV or XLSX)
- ⚙️ Control logging verbosity
- 🌐 Select specific news sources or use all available

**Note:** This notebook is optimized for Google Colab. Output files (CSV/XLSX) will appear in the Files panel on the left.

## 🌐 Supported News Websites

- [Antaranews.com](https://www.antaranews.com/)
- [Bisnis.com](https://www.bisnis.com/)
- [Bloomberg Technoz](https://www.bloombergtechnoz.com/)
- [CNBC Indonesia](https://www.cnbcindonesia.com/)
- [Detik.com](https://www.detik.com/)
- [Jawapos.com](https://www.jawapos.com/)
- [Katadata.co.id](https://katadata.co.id/)
- [Kompas.com](https://www.kompas.com/)
- [Kontan.co.id](https://www.kontan.co.id/)
- [Media Indonesia](https://mediaindonesia.com/)
- [Metrotvnews.com](https://metrotvnews.com/)
- [Okezone.com](https://www.okezone.com/)
- [Tempo.co](https://www.tempo.co/)
- [Viva.co.id](https://www.viva.co.id/)

**Platform Notes:**
- Some scrapers (Kontan.co.id, Jawapos, Bisnis.com) are automatically excluded on Linux platforms due to compatibility issues
- Use `-s all` to force all scrapers (may cause errors on some platforms)
- Kontan.co.id scraper has a maximum limit of 50 pages

## 🛠️ Installation

### Step 1: Install the news-watch package

In [None]:
!pip install news-watch --upgrade

### Alternative: Install development version

In [None]:
# !pip install git+https://github.com/okkymabruri/news-watch.git@dev

### Step 2: Install Playwright browsers (Required)

In [None]:
!playwright install chromium

## 📖 Command Line Arguments

### Required Arguments:
- `-k` or `--keywords`: Comma-separated list of keywords (e.g., `"ihsg,bank,keuangan"`)
- `-sd` or `--start_date`: Start date in YYYY-MM-DD format (e.g., `2025-01-01`)

### Optional Arguments:
- `-s` or `--scrapers`: Specific scrapers to use (default: 'auto')
  - `auto`: Platform-appropriate scrapers (recommended)
  - `all`: Force all scrapers (may fail on some platforms)
  - Specific names: e.g., `"kompas,detik,cnbcindonesia"`
- `-of` or `--output_format`: Output format (`csv` or `xlsx`, default: csv)
- `-v` or `--verbose`: Show all logging output (silent by default)
- `--list_scrapers`: List all supported scrapers

### 💡 Shell Commands in Notebooks
The "!" prefix executes shell commands in notebook cells. Since news-watch is a command-line tool, all commands must start with "!".

## 🚀 Getting Started

### Display help information

In [None]:
!newswatch --help

### List available scrapers

In [None]:
!newswatch --list_scrapers

## 📝 Basic Examples

### Example 1: Simple keyword search

In [None]:
!newswatch --keywords ihsg --start_date 2025-01-01

### Example 2: Multiple keywords with verbose output

In [None]:
!newswatch -k "ihsg,bank,keuangan" -sd 2025-01-01 -v

### Example 3: Economic keywords (Indonesian terms)

In [None]:
!newswatch -k "pasar modal,kebijakan,suku bunga" -sd 2025-01-01 -v

## 🎯 Advanced Examples

### Example 4: Specific news sources with Excel output

In [None]:
!newswatch -k "presiden" -s "antaranews,bisnis,detik,cnbcindonesia" --output_format xlsx -sd 2025-01-01

### Example 5: High-quality sources for financial news

In [None]:
!newswatch -k "investasi,saham" -s "kontan,bisnis,cnbcindonesia" -sd 2025-01-01 --output_format xlsx -v

### Example 6: Political news from major sources

In [None]:
!newswatch -k "pemilu,politik" -s "kompas,tempo,detik" -sd 2025-01-01 -v

### Example 7: Force all scrapers (use with caution)

In [None]:
# !newswatch -k "ekonomi" -sd 2025-01-01 -s "all" -v

## 📊 Output Format

The scraped articles are saved with the format: `news-watch-{keywords}-YYYYMMDD_HH`

### Output columns include:
- `title`: Article headline
- `publish_date`: Publication date
- `author`: Article author
- `content`: Full article content
- `keyword`: Matched keyword
- `category`: News category
- `source`: News website source
- `link`: Original article URL

### File formats:
- **CSV**: Default format, smaller file size
- **XLSX**: Excel format, better for analysis tools

## ⚠️ Important Notes

### Ethical Usage:
- This tool is for **educational and research purposes only**
- Users must comply with each website's Terms of Service and robots.txt
- Avoid aggressive scraping that could overload servers
- Respect rate limits and website policies

### Technical Limitations:
- Some scrapers may be excluded on certain platforms for compatibility
- Kontan.co.id has a 50-page limit
- Network issues may affect scraping success
- Some websites may implement anti-scraping measures

### Troubleshooting:
- If scraping fails, try with specific scrapers using `-s`
- Use `-v` flag to see detailed logging for debugging
- Check internet connection and website availability
- Some websites may be temporarily unavailable


---
*Happy scraping! 🚀*