A fast, modern Python library for classifying news articles as hard news vs. soft news using multiple approaches: URL patterns, machine learning models, and Large Language Models.
🚀 Three Classification Methods:
- URL Pattern Analysis - Lightning-fast classification using URL structure
- ML Models - Trained scikit-learn models for US/UK news prediction
- LLM Classification - Flexible categorization using Claude or OpenAI
🌍 Multi-Region Support:
- US and UK news patterns and models
- Easily extensible to other regions
⚡ Modern Architecture:
- Unified API with consistent interface
- Click-based CLI for command-line usage
- Built with uv_build for 10-35x faster builds
- Type hints and comprehensive error handling
Streamlit Demo: https://notnews-notnews-streamlitstreamlit-app-u8j3a6.streamlit.app/
import pandas as pd
import notnews
# Load your data
df = pd.read_csv("news_articles.csv")
# Method 1: URL Pattern Classification (fastest)
df_url = notnews.classify_by_url(df, url_col="url", region="us")
print(df_url[["url", "hard_news", "soft_news"]].head())
# Method 2: ML Model Prediction (most accurate)
df_ml = notnews.predict_soft_news(df, text_col="text", region="us")
print(df_ml[["text", "prob_soft_news_us"]].head())
# Method 3: LLM Classification (most flexible)
# Requires ANTHROPIC_API_KEY or OPENAI_API_KEY environment variable
df_llm = notnews.classify_with_llm(df, text_col="text", provider="claude")
print(df_llm[["text", "llm_category", "llm_confidence"]].head())
# Detailed Categories (US only)
df_categories = notnews.predict_news_category(df, text_col="text")
print(df_categories[["text", "pred_category", "prob_soft_news"]].head())# Install the package
pip install notnews
# or with uv
uv add notnews
# URL pattern classification
notnews classify-urls articles.csv --region us --output results.csv
# ML model prediction
notnews predict-ml articles.csv --region uk --text-col content
# LLM classification
notnews classify-llm articles.csv --provider claude --api-key your_key
# Run all methods together
notnews classify-all articles.csv --region us
# Get help
notnews --help
notnews classify-urls --helppip install notnewsuv add notnews- Python: 3.11, 3.12, or 3.13
- Core: pandas, numpy, scikit-learn 1.3+, nltk
- Web: requests, beautifulsoup4
- CLI: click 8.0+
- Optional: anthropic, openai (for LLM classification)
For LLM classification, set your API key:
# For Claude
export ANTHROPIC_API_KEY="your_key_here"
# For OpenAI
export OPENAI_API_KEY="your_key_here"Classify articles using URL pattern matching.
Args:
df: DataFrame with articlesurl_col: Column containing URLsregion: "us" or "uk" for region-specific patterns
Returns: DataFrame with hard_news and soft_news columns
Predict soft news probability using trained ML models.
Args:
df: DataFrame with articlestext_col: Column containing article textregion: "us" or "uk" for model selection
Returns: DataFrame with prob_soft_news_{region} column
Classify articles using Large Language Models.
Args:
df: DataFrame with articlestext_col: Column containing article textprovider: "claude" or "openai"categories: Optional custom categories dictapi_key: Optional API key (uses env var if not provided)
Returns: DataFrame with llm_category, llm_confidence, llm_reasoning columns
# Custom LLM categories
custom_categories = {
"breaking": {"description": "Breaking news and urgent updates"},
"analysis": {"description": "In-depth analysis and commentary"},
"lifestyle": {"description": "Lifestyle and entertainment content"}
}
df_custom = notnews.classify_with_llm(
df,
provider="claude",
categories=custom_categories
)
# Fetch content from URLs
content = notnews.fetch_web_content("https://example.com/article")- US: Politics, economics, international affairs vs. sports, entertainment, lifestyle
- UK: Includes UK-specific patterns like "uk-news", "scottish-news"
- US: NYT-based models trained on headline and content text
- UK: URL-based model trained on UK news outlets
- Compatible with scikit-learn 1.3-1.5 (models trained on 0.22+)
- URL Classification: ~1000 articles/second
- ML Prediction: ~100 articles/second
- LLM Classification: ~1-10 articles/second (API dependent)
- US Model: Based on NYT data
- UK Model: Based on UK news analysis
Research using notnews:
Full documentation: notnews.readthedocs.io
We welcome contributions! Please see our Contributor Code of Conduct.
git clone https://github.com/notnews/notnews.git
cd notnews
uv sync --dev
uv run pytest- Suriyan Laohaprapanon
- Gaurav Sood