# NYT Front Page News Collection for Financial Analysis

This notebook collects New York Times front page news headlines and content from 2004-2025 for comprehensive financial market analysis.

## Features
- **Complete Coverage**: Pages 1-10 for full front page story coverage
- **Rich Content**: Headlines, lead paragraphs, snippets, and metadata
- **Sectioned Organization**: News organized by themes (Politics, Economics, Health, etc.)
- **JSON Output**: Organized data files ready for analysis
- **Historical Range**: 2004-2025 (22 years of front page news)

## Quick Start
1. Set your API key in the configuration cell
2. Run the comprehensive collection function
3. Data will be saved as organized JSON files

In [19]:
pip install requests pandas

Note: you may need to restart the kernel to use updated packages.


In [117]:
# IMPORTS AND LIBRARIES
import requests
import time
import os
import json
import pandas as pd
from datetime import datetime, timedelta
from tqdm import tqdm
from dateutil import parser

In [119]:
# CONFIGURATION
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
from pathlib import Path

@dataclass
class Config:
    """Configuration settings for NYT front page news collection."""
    api_key: str = "4BoKNTP1zEPMKk9BG8j4eqLn94pZV63Z"
    output_dir: str = "historical_front_page_news"
    max_pages: int = 10
    rate_limit_delay: float = 1.0
    base_url: str = "https://api.nytimes.com/svc/archive/v1"

# Initialize configuration
config = Config()

print("Configuration loaded!")
print(f"Output directory: {config.output_dir}")
print(f"API Key: {config.api_key[:10]}...{config.api_key[-5:]}")
print(f"Max pages: {config.max_pages}")

Configuration loaded!
Output directory: historical_front_page_news
API Key: 4BoKNTP1zE...ZV63Z
Max pages: 10


In [120]:
# CORE UTILITY FUNCTIONS
def ensure_directory_exists(directory_path: str) -> None:
    Path(directory_path).mkdir(parents=True, exist_ok=True)


def is_front_page_article(article: Dict, max_page: int = 1) -> bool:
    """Modified to focus on true front page (page 1) by default"""
    print_page = article.get("print_page")
    print_section = article.get("print_section", "")
    
    try:
        page_number = int(print_page) if print_page else 0
        
        # True front page: page 1 in section A (or no section specified)
        if page_number == 1 and (print_section == "A" or print_section == ""):
            return True
            
        # Allow flexibility with max_page if needed
        return 1 <= page_number <= max_page
    except (ValueError, TypeError):
        return False


def parse_publication_date(date_string: str) -> Optional[datetime]:
    try:
        return parser.parse(date_string) if date_string else None
    except Exception:
        return None

In [121]:
@dataclass
class ArticleSummary:
    """Structured representation of a news article."""
    headline: str
    snippet: str
    abstract: str
    lead_paragraph: str
    publication_date: str
    section_name: str
    print_page: str
    page_section: str
    web_url: str
    byline: str

def extract_article_data(raw_article: Dict) -> ArticleSummary:
    """Extract and structure article data from NYT API response."""
    return ArticleSummary(
        headline=raw_article.get("headline", {}).get("main", ""),
        snippet=raw_article.get("snippet", ""),
        abstract=raw_article.get("abstract", ""),
        lead_paragraph=raw_article.get("lead_paragraph", ""),
        publication_date=raw_article.get("pub_date", ""),
        section_name=raw_article.get("section_name", ""),
        print_page=raw_article.get("print_page", ""),
        page_section = raw_article.get("print_section", ""),
        web_url=raw_article.get("web_url", ""),
        byline=raw_article.get("byline", {}).get("original", ""),
    )


def process_articles_batch(raw_articles: List[Dict]) -> List[Dict]:
    processed_articles = []
    for raw_article in raw_articles:
        try:
            article_summary = extract_article_data(raw_article)
            # Convert dataclass to dictionary for JSON serialization
            processed_articles.append(article_summary.__dict__)
        except Exception as e:
            print(f"Error processing article: {e}")
            continue
    
    return processed_articles

In [122]:
def ensure_directory_exists(directory_path: str):
    Path(directory_path).mkdir(parents=True, exist_ok=True)

In [123]:
# ENHANCED FILE OPERATIONS WITH SECTION ORGANIZATION
def save_articles_to_json(articles: List[Dict], target_date: str, output_directory: str) -> bool:
    """
    Save articles to JSON file with only the standard (flat) structure.
    Output: { "metadata": { ... }, "articles": [ ... ] }
    """
    try:
        ensure_directory_exists(output_directory)
        data = {
            "metadata": {
                "date": target_date,
                "total_articles": len(articles)
            },
            "articles": articles
        }
        filename = f"{target_date}.json"
        filepath = Path(output_directory) / filename
        with open(filepath, 'w', encoding='utf-8') as file:
            json.dump(data, file, ensure_ascii=False, indent=2)
        print(f"Saved standard file: {filepath}")
        return True
    except Exception as e:
        print(f"Error saving to JSON: {e}")
        return False


###SECTION FILE STRUCTURE

In [None]:
"""
def organize_articles_by_section(articles: List[Dict]) -> Dict[str, List[Dict]]:
    sections = defaultdict(list)
    for article in articles:
        section = article.get("section_name", "Unknown") or "Unknown"
        sections[section].append(article)
    return dict(sections)

def save_articles_by_sections(articles: List[Dict], target_date: str, output_directory: str) -> bool:
    try:
        ensure_directory_exists(output_directory)
        organized_articles = organize_articles_by_section(articles)

        structured_data = {
            "metadata": {
                "date": target_date,
                "total_articles": len(articles)
            },
            "sections": organized_articles
        }

        filename = f"news_by_sections_{target_date}.json"
        filepath = Path(output_directory) / filename

        with open(filepath, 'w', encoding='utf-8') as file:
            json.dump(structured_data, file, ensure_ascii=False, indent=2)

        print(f"Saved {len(articles)} articles grouped by section to {filepath}")
        return True

    except Exception as e:
        print(f"Error saving sectioned JSON: {e}")
        return False
"""

'\ndef organize_articles_by_section(articles: List[Dict]) -> Dict[str, List[Dict]]:\n    sections = defaultdict(list)\n    for article in articles:\n        section = article.get("section_name", "Unknown") or "Unknown"\n        sections[section].append(article)\n    return dict(sections)\n\ndef save_articles_by_sections(articles: List[Dict], target_date: str, output_directory: str) -> bool:\n    try:\n        ensure_directory_exists(output_directory)\n        organized_articles = organize_articles_by_section(articles)\n\n        structured_data = {\n            "metadata": {\n                "date": target_date,\n                "total_articles": len(articles)\n            },\n            "sections": organized_articles\n        }\n\n        filename = f"news_by_sections_{target_date}.json"\n        filepath = Path(output_directory) / filename\n\n        with open(filepath, \'w\', encoding=\'utf-8\') as file:\n            json.dump(structured_data, file, ensure_ascii=False, indent=2)\n\

In [124]:
def save_complete_daily_news(articles: List[Dict], target_date: str, output_directory: str) -> bool:
    print(f"SAVING DAILY NEWS FOR {target_date}")
    print("Saving standard format...")
    "if u want save by sections, use save_articles_by_sections function"
    success = save_articles_to_json(articles, target_date, output_directory) 
    if success:
        print(f"SUCCESS! Standard file saved: {target_date}.json")
    else:
        print(f"FAILED! Could not save standard file.")
    return success

In [125]:
def fetch_month_articles(api_key: str, year: int, month: int) -> List[Dict]:
    """Fetch all articles for a specific month from the NYT Archive API."""
    url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json"
    params = {"api-key": api_key}
    try:
        resp = requests.get(url, params=params)
        resp.raise_for_status()
        data = resp.json()
        return data.get("response", {}).get("docs", [])
    except Exception as e:
        print(f"Error fetching articles for {year}-{month:02d}: {e}")
        return []

def filter_articles_by_date(articles: List[Dict], target_date: str) -> List[Dict]:
    """Filter articles published on the target date (YYYY-MM-DD)."""
    return [a for a in articles if a.get("pub_date", "").startswith(target_date)]

In [126]:
def collect_news_for_date(target_date, api_key, output_dir="news_data"):
    date_obj = datetime.strptime(target_date, "%Y-%m-%d")
    year, month = date_obj.year, date_obj.month

    all_articles = fetch_month_articles(api_key, year, month)
    front_page_articles = [a for a in all_articles if is_front_page_article(a)]
    daily_articles = filter_articles_by_date(front_page_articles, target_date)
    processed_articles = process_articles_batch(daily_articles)

    if processed_articles:
        save_articles_to_json(processed_articles, target_date, output_dir)

    return processed_articles


In [None]:

def map_news_to_trading_day(pub_datetime_str, close_hour=18, close_minute=0):
    dt = pd.to_datetime(pub_datetime_str)
    # Hafta sonuysa Pazartesi'ye kaydır
    if dt.weekday() >= 5:
        days_to_monday = 7 - dt.weekday()
        trading_day = dt + timedelta(days=days_to_monday)
        return trading_day.strftime("%Y-%m-%d")
    # Hafta içi, kapanış sonrası ise ertesi gün (ve yine hafta sonu kontrolü):
    if dt.hour > close_hour or (dt.hour == close_hour and dt.minute > close_minute):
        trading_day = dt + timedelta(days=1)
        while trading_day.weekday() >= 5:
            trading_day += timedelta(days=1)
        return trading_day.strftime("%Y-%m-%d")
    # Diğer durumlarda aynı gün:
    return dt.strftime("%Y-%m-%d")

In [None]:
def get_trading_date(pub_datetime_str):
    dt = pd.to_datetime(pub_datetime_str)
    # Eğer hafta sonuysa, bir sonraki Pazartesi'ye kaydır
    if dt.weekday() >= 5:
        days_to_monday = 7 - dt.weekday()
        dt = dt + timedelta(days=days_to_monday)
        return dt.strftime("%Y-%m-%d")
    return dt.strftime("%Y-%m-%d")

In [130]:
# SECTION FILTERING FOR FINANCIAL ANALYSIS
def is_financial_relevant_section(article: Dict, include_sections=None) -> bool:
    """
    Filter articles by sections relevant for financial analysis.
    
    Args:
        article: Article dictionary
        include_sections: List of sections to include. If None, uses default financial sections.
    
    Returns:
        bool: True if article is in a relevant section
    """
    if include_sections is None:
        # Default financial-relevant sections
        include_sections = [
            "Business Day", "Business", "Economy", "Economic", "Finance", "Financial",
            "Markets", "Market", "Technology", "Tech", "Politics", "Political",
            "U.S.", "World", "International", "Global", "Energy", "Oil",
            "Federal Reserve", "Treasury", "Trade", "Commerce"
        ]
    
    section_name = article.get("section_name", "").lower()
    news_desk = article.get("news_desk", "").lower()
    subsection_name = article.get("subsection_name", "").lower()
    
    # Check if any of the include_sections match
    for section in include_sections:
        section_lower = section.lower()
        if (section_lower in section_name or 
            section_lower in news_desk or 
            section_lower in subsection_name):
            return True
    
    return False

In [None]:
# 3. Günlük Pipeline — Her haberi işlem gününe göre ilgili dosyaya kaydeder
def collect_news_for_date(target_date, api_key, output_dir="news_data"):
    date_obj = datetime.strptime(target_date, "%Y-%m-%d")
    year, month = date_obj.year, date_obj.month

    all_articles = fetch_month_articles(api_key, year, month)
    front_page_articles = [a for a in all_articles if is_front_page_article(a)]
    daily_articles = filter_articles_by_date(front_page_articles, target_date)
    financial_articles = [a for a in daily_articles if is_financial_relevant_section(a)]
    processed_articles = process_articles_batch(financial_articles)

    # --- İşlem gününe göre grupla ve kaydet ---
    trading_day_groups = {}
    for article in processed_articles:
        pub_date = article.get("publication_date", "")
        trading_date = map_news_to_trading_day(pub_date)
        trading_day_groups.setdefault(trading_date, []).append(article)
    for trading_date, articles in trading_day_groups.items():
        save_articles_to_json(articles, trading_date, output_dir)
    return processed_articles

In [None]:
# 4. Ana Toplu Çekici
def process_all_front_page_news(api_key, output_dir="news_data", start_year=2004, end_year=2025):
    start_date = datetime(start_year, 1, 1)
    end_date = datetime(end_year, 12, 31)
    current_date = start_date

    while current_date <= end_date:
        target_date = current_date.strftime("%Y-%m-%d")
        try:
            collect_news_for_date(target_date, api_key, output_dir)
        except Exception as e:
            print(f"Error processing {target_date}: {e}")
        time.sleep(6)  # API rate limit için
        current_date += timedelta(days=1)

process_all_front_page_news(config.api_key, config.output_dir, start_year=2004, end_year=2025)

Saved standard file: historical_front_page_news\2004-01-01.json
Saved standard file: historical_front_page_news\2004-01-02.json
Saved standard file: historical_front_page_news\2004-01-02.json
Saved standard file: historical_front_page_news\2004-01-03.json
Saved standard file: historical_front_page_news\2004-01-03.json
Saved standard file: historical_front_page_news\2004-01-04.json
Saved standard file: historical_front_page_news\2004-01-04.json
Saved standard file: historical_front_page_news\2004-01-05.json
Saved standard file: historical_front_page_news\2004-01-05.json
Saved standard file: historical_front_page_news\2004-01-06.json
Saved standard file: historical_front_page_news\2004-01-06.json
Saved standard file: historical_front_page_news\2004-01-07.json
Saved standard file: historical_front_page_news\2004-01-07.json
Saved standard file: historical_front_page_news\2004-01-08.json
Saved standard file: historical_front_page_news\2004-01-08.json
Saved standard file: historical_front_pa

KeyboardInterrupt: 

## 🎯 HOW TO USE THE COLLECTION FUNCTION

The `collect_news_for_date()` function is now ready! Here's how to use it:

### Basic Usage:
```python
# Collect news for a specific date
articles = collect_news_for_date('2024-12-01', config.api_key)
```

### What it does:
1. **Fetches** all articles for the month (December 2024)
2. **Filters** for front page articles only (pages 1-10) 
3. **Filters** for the specific date (December 1st)
4. **Processes** articles into clean format
5. **Saves** to JSON file: `news_data/2024-12-01.json`

### Try different dates:
```python
# Recent dates
articles = collect_news_for_date('2024-11-15', config.api_key)
articles = collect_news_for_date('2024-10-31', config.api_key)

# Historical dates  
articles = collect_news_for_date('2020-03-15', config.api_key)  # COVID start
articles = collect_news_for_date('2008-09-15', config.api_key)  # Financial crisis
```

**✅ Run the test cell below to see it in action!**