# Automated Topic Summary Page Generation

## 1. Project Introduction
This project automates the extraction, summarization, and presentation of keyword-related news events. It consists of several modules:

1. **Data Cleaning (`cleaned_the_data`)** – cleans and standardizes raw news articles from `raw_news.json`, removing invalid entries, deduplicating using (title, link) hashes, stripping non-printable characters, normalizing text and dates, and producing `cleaned_news.json` as a reliable dataset.

2. **Timeline Extraction (`extract_timeline`)** – groups news articles by date and generates one-line summaries of major events per day, using GPT-5 with prompts focused on the target keyword, producing a structured timeline.

3. **Entity Extraction (`extract_entities`)** – extracts and normalizes people, organizations, and entities mentioned in the news dataset, generating structured JSON for downstream reporting.

4. **Automated Summarization (`summary`)** – generates a three-paragraph narrative summary using an authoritative timeline and optional headlines, enforcing strict formatting rules and avoiding any fabrication of facts.

5. **Report Generation (`generate_report`)** – compiles the summary, entities, timeline, and links into a visually appealing HTML report (`index.html`) with interactive cards and a responsive layout, accompanied by a self-contained CSS file for styling.

The pipeline produces clean, structured, and human-readable outputs, enabling efficient analysis, dissemination, and visualization of key events and entities related to the target keyword.

In [None]:
# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
from datetime import datetime, timedelta, timezone
from tqdm import tqdm
import json
import os
import re
from collections import Counter, defaultdict
from typing import List, Dict, Set,Any, Tuple
from difflib import SequenceMatcher
from openai import OpenAI
import argparse, os, json, re, time, logging, threading, socket, random
from urllib.parse import urlparse
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from pathlib import Path
from datetime import datetime

## 2. Crawl the news
**crawl_the_news** is the first stage of the pipeline, responsible for automatically collecting news data and exporting it as structured JSON. It integrates **five major news APIs (NewsAPI, GNews, TheNewsAPI, CurrentsAPI, Mediastack)** to maximize coverage and reduce information gaps. The workflow includes: randomly splitting the time range to improve NewsAPI retrieval; calling each API’s fetch_xxx() function to collect titles, timestamps, and URLs; extracting the full article text using BeautifulSoup with extensive CSS selectors and a fallback paragraph-based method; and normalizing all articles into a unified {title, date, link, text} structure. The final output is saved as raw_news.json, which serves as the input for cleaning, deduplication, entity extraction, and summarization modules.

In [None]:
# Key
NewsAPI_Key = "8406ef98a8b24bec854801aa9f2c6a35"
GNews_Key = "9a6066514e3ca31d8ec6c184b2c33594"
TheNewsAPI_Key = "wEj2kyyJhPKLICmZavDq2MeJgbOr1KcyLbU0X3Au"
CurrentsAPI_Key = "wMSLtPfn74YOMCOyIGv49vXAfIrD2bcXGVgEj_zN1AgA8b3G"
Mediastack_Key = "465890a7953f6a540676c7c0fb86508a"

# URL
NewsAPI_URL = "https://newsapi.org/v2/everything"
GNews_URL = "https://gnews.io/api/v4/search"
TheNewsAPI_URL = "https://api.thenewsapi.com/v1/news/all"
Mediastack_URL = "http://api.mediastack.com/v1/news"

# json name
raw_json = "raw_news.json"
cleaned_json = "cleaned_news.json"

In [None]:
def extract_article_content(url):
    """
    Extract main content from news webpage URL
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Remove unwanted tags
        for tag in ['script', 'style', 'nav', 'header', 'footer', 'aside']:
            for element in soup.find_all(tag):
                element.decompose()
        
        # Content selectors for news websites
        content_selectors = [
            # Main content area
            'article',
            'main',
            '.main-content',
            '.content-main',
            '#main-content',
            '#content-main',
            
            # News specific selector
            '.article',
            '.story',
            '.news-article',
            '.post',
            '.entry',
            
            # Main content
            '.article-body',
            '.story-body',
            '.post-body',
            '.entry-content',
            '.article-content',
            '.story-content',
            '.post-content',
            '.news-content',
            '.content-body',
            '.body-content',
            
            # text content
            '.text-content',
            '.article-text',
            '.story-text',
            '.post-text',
            
            # General Content
            '[class*="content"]',
            '[class*="article"]',
            '[class*="story"]',
            '[class*="post"]',
            '[class*="entry"]',
            '[class*="body"]',
            '[class*="text"]',
            
            # Specific news websites
            '.zn-body__paragraph',  # CNN
            '.caas-body',           # Yahoo News
            '.Article__Content',    # Bloomberg
            '.article-section',     # Reuters
            '.article-page',        # BBC
            '.story-wrapper',       # NBC
            '.article-wrapper',
            
            # Container selector
            '.container',
            '.wrapper',
            '.main',
            '#main',
            '#content',
            '.page-content'
        ]
        
        # Try selectors first
        for selector in content_selectors:
            elements = soup.select(selector)
            for element in elements:
                text = element.get_text(strip=True)
                text = re.sub(r'\s+', ' ', text)
                if len(text) > 200:
                    return text
        
        # Fallback: combine paragraphs
        paragraphs = soup.find_all('p')
        if paragraphs:
            content = ' '.join(p.get_text(strip=True) for p in paragraphs if len(p.get_text(strip=True)) > 50)
            content = re.sub(r'\s+', ' ', content)
            if len(content) > 100:
                return content
        
        return "No valid content extracted"
        
    except Exception as e:
        return f"Error: {str(e)}"

In [None]:
def save_to_json(data, filename='raw_news.json'):
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=2)
        print(f"The data has been saved to {filename}")
        return True
    except Exception as e:
        print(f"Error saving file: {e}")
        return False

In [None]:
def fetch_news_from_newsapi(keyword, start_time, end_time):
    params = {
        'q': keyword,
        'from': start_time,
        'to': end_time,
        'sortBy': 'publishedAt',
        'pageSize': 100,
        'language': 'en',
        'apiKey': NewsAPI_Key
    }
    
    try:
        response = requests.get(NewsAPI_URL, params=params)
        response.raise_for_status()
        data = response.json()
        articles = data.get('articles', [])
        print(f"Fetched {len(articles)} articles from NewsAPI")
        return articles
    except Exception as e:
        print(f"NewsAPI request failed: {e}")
        return []

def fetch_news_from_gnews(keyword, start_time, end_time):  
    params = {
        'q': keyword,
        'from': start_time,
        'to': end_time,
        'max': 100,
        'lang': 'en',
        'token': GNews_Key
    }
    
    try:
        response = requests.get(GNews_URL, params=params)
        response.raise_for_status()
        data = response.json()
        articles = data.get('articles', [])
        print(f"Fetched {len(articles)} articles from GNews")
        return articles
    except Exception as e:
        print(f"GNews API request failed: {e}")
        return []

def fetch_news_from_thenewsapi(keyword, start_time, end_time):
    params = {
        'api_token': TheNewsAPI_Key,
        'search': keyword,
        'published_after': start_time,
        'language': 'en',
        'limit': 100
    }
    
    try:
        response = requests.get(TheNewsAPI_URL, params=params)
        response.raise_for_status()
        data = response.json()
        articles = data.get('data', [])
        print(f"Fetched {len(articles)} articles from The News API")
        return articles
    except Exception as e:
        print(f"The News API request failed: {e}")
        return []

def fetch_nobel_news_from_currentsapi(keyword, start_time, end_time):

    start_time = datetime.strptime(start_time, "%Y-%m-%d").replace(tzinfo=timezone.utc)
    end_time = datetime.strptime(end_time, "%Y-%m-%d").replace(tzinfo=timezone.utc)
    
    url = (f'https://api.currentsapi.services/v1/search?'
           f'keywords={keyword}&language=en&'
           f'apiKey={CurrentsAPI_Key}&'
           f'start_date{start_time}&end_date{end_time}')
    
    try:
        response = requests.get(url)
        response.raise_for_status()
        
        data = response.json()
        
        if data.get('status') == 'ok':
            articles = data.get('news', [])
            print(f"Fetched {len(articles)} articles from CurrentsAPI")
            return articles
        else:
            print(f"CurrentsAPI returned error: {data.get('message', 'Unknown error')}")
            return []
        
    except requests.exceptions.RequestException as e:
        print(f"CurrentsAPI request failed: {e}")
        return []
    except json.JSONDecodeError:
        print("Failed to parse CurrentsAPI response")
        return []
    except Exception as e:
        print(f"Unexpected error: {e}")
        return []

def fetch_news_from_mediastack(keyword, start_time, end_time):
    params = {
        'access_key': Mediastack_Key,
        'keywords': keyword,
        'languages': 'en',
        'limit': 100,
        'sort': 'published_desc',
        'date': f'{start_time},{end_time}'
    }
    
    try:
        response = requests.get(Mediastack_URL, params=params)
        data = response.json()
        
        if 'data' in data:
            articles = data.get('data', [])
            print(f"Fetched {len(articles)} articles from Mediastack")
            return articles
        else:
            print(f"Error: {data.get('error', 'Unknown error')}")
            return []
            
    except Exception as e:
        print(f"Mediastack API error: {e}")
        return []

In [None]:
def process_news_data(keyword, start_time, end_time):
    print("Start obtaining news data...")

    raw_data = []
    # 
    start_dt = datetime.strptime(start_time, "%Y-%m-%d")
    end_dt = datetime.strptime(end_time, "%Y-%m-%d")
    delta = end_dt - start_dt
    random_days = random.randint(0, delta.days)
    middle_time = (start_dt + timedelta(days=random_days)).strftime("%Y-%m-%d")
    
    # Get data from all APIs
    # Search twice
    newsapi_articles_partone = fetch_news_from_newsapi(keyword, start_time, middle_time)
    newsapi_articles_parttwo = fetch_news_from_newsapi(keyword, middle_time, end_time)
    gnews_articles = fetch_news_from_gnews(keyword, start_time, end_time)
    thenewsapi_articles = fetch_news_from_thenewsapi(keyword, start_time, end_time)
    currents_articles = fetch_nobel_news_from_currentsapi(keyword, start_time, end_time)
    # Search twice
    mediastack_articles = fetch_news_from_mediastack(keyword, start_time, end_time)
    
    # Combine all articles
    all_articles = []
    all_articles.extend(newsapi_articles_partone)
    all_articles.extend(newsapi_articles_parttwo)
    all_articles.extend(gnews_articles)
    all_articles.extend(thenewsapi_articles)
    all_articles.extend(currents_articles)
    all_articles.extend(mediastack_articles)
    
    print(f"Total articles: {len(all_articles)}")
    
    for i, article in enumerate(all_articles, 1):
        print(f"Processing {i}/{len(all_articles)}: {article['title'][:50]}...")
        
        # Extract article content
        text_content = extract_article_content(article['url'])
        
        # Build data structure
        news_item = {
            "title": article.get('title', 'No title'),
            "date": article.get('publishedAt', 'No date'),
            "link": article.get('url', ''),
            "text": text_content
        }
        
        raw_data.append(news_item)
        
        # Add delay to avoid rate limiting
        time.sleep(1)
    
    return raw_data

In [None]:
def crawl_the_news(keyword, start_time, end_time): 
    # Processing news data
    raw_news_data = process_news_data(keyword, start_time, end_time)
    
    if raw_news_data:
        # Save to JSON file
        success = save_to_json(raw_news_data, raw_json)
        
        if success:
            print(f"Successfully processed {len(raw_news_data)} articles")
        else:
            print("Failed to save file")
    else:
        print("No data obtained")

## 3. Clean the data

**cleaned_the_data** cleans and standardizes the raw dataset. It takes raw_news.json as input and produces a filtered, deduplicated, and normalized dataset cleaned_news.json. The module removes empty or invalid articles, eliminates duplicates using a (title, link) hash set, strips non-printable characters, converts article text to lowercase, and normalizes all dates to YYYY-MM-DD. Each cleaned record is stored in a unified structure {title, date, link, text}. The resulting cleaned_news.json serves as the clean and reliable input for downstream summarization, entity extraction, and timeline generation.

In [None]:
# cleaned data structure:

# cleaned_data_list = []
# cleaned_data = {
#     "title" : title
#     "date" : date
#     "link" : link
#     "text" : text
# }

def cleaned_the_data():
    # Load the original file
    with open(raw_json, "r", encoding="utf-8") as f:
        data = json.load(f)
    
    cleaned = []
    seen = set()
    
    for item in data:
        title = item.get("title", "").strip()
        link = item.get("link", "").strip()
        text = item.get("text", "").strip()
        date_str = item.get("date", "").strip()
    
        # Skip empty records or invalid text
        if not title or not link or not text:
            continue
        if text.lower() == "no valid content extracted".lower():
            continue
    
        # Skip duplicates
        if (title, link) in seen:
            continue
        seen.add((title, link))
    
        # Remove gibberish or control characters (keep printable English/Chinese chars)
        def clean_str(s):
            return re.sub(r"[^\x09\x0A\x0D\x20-\x7E\u4E00-\u9FFF]", " ", s)
    
        title = clean_str(title)
        text = clean_str(text).lower()  # convert all text to lowercase
    
        # Normalize date format to YYYY-MM-DD
        if date_str:
            try:
                dt = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
                date_str = dt.strftime("%Y-%m-%d")
            except Exception:
                match = re.search(r"(\d{4})[-/](\d{2})[-/](\d{2})", date_str)
                if match:
                    date_str = "-".join(match.groups())
                else:
                    date_str = ""
    
        cleaned.append({
            "title": title,
            "date": date_str,
            "link": link,
            "text": text.strip()
        })
    
    # Save cleaned data
    with open("../../assignment2/cleaned_news.json", "w", encoding="utf-8") as f:
        json.dump(cleaned, f, ensure_ascii=False, indent=2)
    
    print(f"Cleaning completed. {len(cleaned)} valid news articles saved to cleaned_news.json.")

## 4. Extract the information

### extract_timeline 
generates a structured daily event timeline using GPT-5.
It takes the cleaned dataset as input, groups all articles by date, and composes a compact text block for each day.
A date-specific prompt is then sent to GPT-5, instructing the model to ignore all items unrelated to the specified keyword and produce exactly one sentence describing the key event of that day, including the main people, organizations, and time.
Each returned summary is stored in the unified structure {date, event}.
The resulting timeline list provides a concise, date-ordered narrative foundation for the final summary webpage.

### extract_entities
extract_entities extracts and normalizes all keyword-related entities from the cleaned dataset using GPT-5.
For each article, a prompt is sent instructing the model to identify and canonicalize names of people, organizations, and prizes, and return them in strict JSON format.
The module parses each JSON response and merges all extracted entries into a unified list following the structure {people, organizations, prize}.
The resulting entity set forms the structured knowledge base used for the final summary page’s key-entity section.

In [None]:
# timeline structure

# timeline_list = []
# timeline = {
#     "date" : date
#     "event" : event
# }

# entities.json

# entities_list = []
# entity = {
#     "people" : people
#     "prize" : prize
#     "organizations" organizaitions
# }

OPENAI_API_KEY="test"

client = OpenAI(api_key=OPENAI_API_KEY)

# ---------------------------------------------------------------------
# 1. Extract timeline
# ---------------------------------------------------------------------
def extract_timeline(news,keyword):
    """
    Group articles by date and ask GPT-5 to summarize each day's major event.
    Returns:
        timeline_list = [
            {"date": "YYYY-MM-DD", "event": "..."},
            ...
        ]
    """
    # group by date
    by_date = defaultdict(list)
    for item in news:
        d = item.get("date")
        if d:
            by_date[d].append(item)

    timeline_list = []
    # process each date, oldest to newest,come up with one-line summary
    for d, items in sorted(by_date.items(), key=lambda x: x[0]):
        # prepare short text for the model
        joined = "\n\n".join([
            f"- Title: {it.get('title','')}\n  Text: {it.get('text','')}"
            for it in items
        ])

        system = (
            f"""
            You are an expert summarizer. Based only on the provided file content, without searching the web,
            remove all news items that are not directly related to the {keyword}
            For each remaining Nobel-related event, compress it into exactly one sentence.
            Each sentence must clearly state the key person(s), organization(s), and time.
            Return the final set of one-sentence events only.
            """
            # """
            # You are given several text messages that were recorded on the same day.
            # Your task is to identify and extract the distinct events mentioned in these messages.
            # """
        )
        user = f"Date: {d}\nNews snippets:\n{joined}\n\nReturn only the final one-line event."

        response = client.responses.create(
            model="gpt-5",
            input=[
                {"role": "system", "content": system},
                {"role": "user", "content": user},
            ]
        )
        content=response.output_text
        timeline_list.append({"date": d, "event": content})

    return timeline_list


# ---------------------------------------------------------------------
# 2. Extract entities
# ---------------------------------------------------------------------

def extract_entities(news,keyword):
    """
    Ask GPT-5 to extract and normalize people, organizations, and prizes
    from the entire dataset.
    Returns:
        entities_dict = {
            "people": [...],
            "organizations": [...],
            "prize": [...]
        }
    """

    entities_list = []
    for item in news:
        # concat limited sample of texts for prompt (avoid overly long input)
        joined = "\n\n".join([
            f"- Title: {item.get('title','')}\n  Text: {item.get('text','')}"
        ])
        user = (
            f"""
            Extract and normalize{keyword}-related named entities from the following articles.
            Return a JSON array where each element has fields: 'people', 'organizations', 'prize'.
            Each entry should contain unique canonical names.
            Ensure the output is strictly valid JSON\n\n
            "Articles:\n{joined}
            """
        )
        response = client.responses.create(
            model="gpt-5",
            input=[{"role": "user", "content": user}],
            # text_format=Entities
        )
        result = json.loads(response.output_text)
        entities_list.extend(result)

    return entities_list

## 5. Summarize

**summary()** generates a three-paragraph English narrative summary of the 2025 Nobel Prizes. It takes cleaned_news.json as input and optionally uses timeline.json as the authoritative timeline, producing final_summary.txt in the specified output directory. The function filters news for keyword-related items, constructs system and user prompts, and calls the API https://open.bigmodel.cn/api/paas/v4/chat/completions via call_glm(). The system prompt (OVERALL_PROMPT) enforces strict format, style, and no-fabrication rules, while the user prompt provides the timeline and news headlines, instructing the model to output exactly three paragraphs. The response is processed to ensure three paragraphs; extra paragraphs are truncated and fewer paragraphs are refactored using refactor_to_three_paragraphs(). AdaptiveLimiter manages request rate and handles 429, timeout, or other errors. The final output aligns with the timeline and ignores non-Nobel or non-2025 events, providing high-quality text for downstream analysis.

In [None]:
API_URL = "https://open.bigmodel.cn/api/paas/v4/chat/completions"

logging.basicConfig(level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s", datefmt="%H:%M:%S")
log = logging.getLogger("topic2")

def ensure_dir(p): os.makedirs(p, exist_ok=True); return p
def load_json(p):
    with open(p, "r", encoding="utf-8") as f: return json.load(f)
def domain_of(url: str) -> str:
    try: return urlparse(url).netloc or ""
    except: return ""
def is_refusal(text: str) -> bool:
    if not text: return True


_SENT_SPLIT = re.compile(r'(?<=[。！？!?\.])\s+(?=[A-Z“"(\[]|[A-Z][a-z])')
def split_sentences(text: str):
    t = re.sub(r'\s+', ' ', (text or "").strip())
    parts = _SENT_SPLIT.split(t)
    if len(parts) <= 1:
        parts = re.split(r'(?<=[\.!?])\s+', t)
    return [s.strip() for s in parts if s.strip()]

def refactor_to_three_paragraphs(text: str):
    sents = split_sentences(text)
    if not sents: return (text or "").strip()
    n = len(sents)
    if n <= 3:
        p1 = " ".join(sents[:1]); p2 = " ".join(sents[1:2]); p3 = " ".join(sents[2:])
        return "\n\n".join([p for p in (p1,p2,p3) if p]).strip()
    p1_len = min(5, max(3, n//6 or 3))
    rem = n - p1_len
    p2_len = min(10, max(6, rem//2 or 6))
    p3_len = n - p1_len - p2_len
    if p3_len < 3 and n >= 12:
        move = min(3 - p3_len, p2_len - 6)
        if move > 0:
            p2_len -= move
            p3_len += move
    p1 = " ".join(sents[:p1_len]).strip()
    p2 = " ".join(sents[p1_len:p1_len+p2_len]).strip()
    p3 = " ".join(sents[p1_len+p2_len:]).strip()
    return "\n\n".join([p for p in (p1,p2,p3) if p]).strip()

_NOBEL_PAT = re.compile(r"\bNobel\b|", re.IGNORECASE)
_YEAR_2025_PAT = re.compile(r"\b2025\b|2025")

def is_nobel_related(title: str, text: str = "") -> bool:
    t = (title or "").strip()
    if not t: return False
    return bool(_NOBEL_PAT.search(t) or _NOBEL_PAT.search(text or ""))

def is_year_2025(title: str, date_str: str, text: str = "") -> bool:
    if _YEAR_2025_PAT.search(title or "") or _YEAR_2025_PAT.search(text or ""):
        return True
    if (date_str or "").startswith("2025-"):
        return True
    return False

class AdaptiveLimiter:
    def __init__(self, qps: float = 0.5, min_qps: float = 0.15, max_qps: float = 1.2):
        self.lock = threading.Lock(); self.qps=qps; self.min_qps=min_qps; self.max_qps=max_qps
        self.last = 0.0; self.cool_until = 0.0
    def wait(self):
        with self.lock:
            now = time.monotonic()
            if now < self.cool_until:
                time.sleep(self.cool_until - now); now = time.monotonic()
            interval = 1.0 / max(self.qps, self.min_qps)
            delta = interval - (now - self.last)
            if delta > 0: time.sleep(delta); now = time.monotonic()
            self.last = now
    def punish_429(self):
        with self.lock:
            self.qps = max(self.min_qps, self.qps * 0.6)
            cool = 6.0 + random.random()*6.0
            self.cool_until = time.monotonic() + cool
            log.warning(f"set off 429：slowdown {self.qps:.2f} QPS，and freeze {cool:.1f}s")
    def punish_timeout(self):
        with self.lock:
            self.qps = max(self.min_qps, self.qps * 0.8)
            log.warning(f"quest over time ：slowdown {self.qps:.2f} QPS")

def make_session():
    s = requests.Session()
    retry = Retry(
        total=3, connect=3, read=3,
        backoff_factor=1.2,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["POST"]
    )
    adapter = HTTPAdapter(max_retries=retry, pool_maxsize=2)
    s.mount("https://", adapter); s.mount("http://", adapter)
    s.headers.update({"Accept":"application/json", "Connection":"close"})
    return s

def call_glm(session: requests.Session, limiter: AdaptiveLimiter, api_key: str, messages,
             model="glm-4.5-flash", temperature=0.26, max_tokens=1400,
             timeout=120, max_retries=3):
    headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"}
    payload = {"model": model, "messages": messages, "temperature": temperature,
               "max_tokens": max_tokens, "stream": False}
    last_err = None
    for attempt in range(max_retries + 1):
        limiter.wait()
        try:
            resp = session.post(API_URL, headers=headers, json=payload, timeout=timeout)
            if resp.status_code == 200:
                j = resp.json()
                txt = j.get("choices", [{}])[0].get("message", {}).get("content", "")
                return txt
            if resp.status_code == 429:
                limiter.punish_429()
                time.sleep(1.2 + random.random()); continue
            if resp.status_code in (408, 500, 502, 503, 504):
                wait = (1.8 + random.random()) * (2 ** attempt)
                log.warning(f"default {resp.status_code}：{resp.text[:160]}...，{wait:.1f}s retry")
                time.sleep(wait); continue
            raise RuntimeError(f"HTTP {resp.status_code}: {resp.text[:500]}")
        except (requests.Timeout, socket.timeout) as e:
            last_err = e; limiter.punish_timeout()
            if attempt < max_retries:
                wait = (1.2 + random.random()) * (2 ** attempt)
                log.warning(f"quest over time，{wait:.1f}s retry")
                time.sleep(wait); continue
            break
        except requests.RequestException as e:
            last_err = e; break
    raise RuntimeError(f"model unusable：{last_err or 'unknown error'}")

# ---------------- main ----------------
def summary(keyword):
    #prompts
    TIMELINE_PREAMBLE = (
        "Here is an authoritative timeline (date + event) that MUST be treated as ground truth. "
        "When headlines conflict, resolve in favor of the timeline. "
        "Do NOT invent prizewinners or entities not supported by the timeline/headlines."
    )
    
    OVERALL_PROMPT = f"""
        You will receive:
        1) An authoritative timeline of the {keyword} (ground truth).
        2) Optionally, a list of headlines (title, publisher domain, date).
        
        Your task: write a **single English narrative summary** ONLY about the **{keyword}**.
        
        STRICT FORMAT & STYLE:
        • **Output EXACTLY THREE PARAGRAPHS**, with a blank line between paragraphs.
        • Paragraph 1 (3–5 sentences): concise highlights in flowing prose — e.g.,
          “the {keyword} honored key contributors, highlighting their impact in relevant fields.
           Use such phrasing **only if these facts are supported**; otherwise use generic wording without adding specifics.”
        • Paragraph 2 (~200–300 words): an integrative overview linking breakthroughs and societal meaning.
        • Paragraph 3 (~180–260 words): synthesize 3–5 cross-cutting themes with transitions (meanwhile, in turn, as a result, by contrast…).
          Do **not** enumerate by date or outlet.
        
        HARD CONSTRAINTS:
        • **NO FABRICATION**: do not invent winners, categories, dates, affiliations, numbers, or methods.
        • **TIMELINE-ALIGNED**: if any conflict arises, prefer the timeline; otherwise generalize.
        • **FOCUS**: Ignore any non-{keyword} or non-2025 items entirely.
        • Output plain text only (no JSON, no headers, no lists).
        """

    ap = argparse.ArgumentParser()
    ap.add_argument("--input", default="cleaned_news.json")
    ap.add_argument("--timeline", default="timeline.json")
    ap.add_argument("--output_dir", default="outputs")
    ap.add_argument("--model", default="glm-4.5-flash")
    ap.add_argument("--api-key", default="d0b8bc52cf6b4c368982dfdd32384757.UcWBjZr72H7AWgyN")
    ap.add_argument("--qps", type=float, default=0.5)
    ap.add_argument("--use-headlines", action="store_true")
    args = ap.parse_args()

    ensure_dir(args.output_dir)

    # load timeline
    if not args.timeline or not os.path.exists(args.timeline):
        log.warning("can not find timeline ，try headlines")
        timeline = []
    else:
        timeline = load_json(args.timeline)
        if not isinstance(timeline, list):
            log.warning(" "); timeline = []
        else:
            log.info(f"loaded timeline：{args.timeline}（{len(timeline)} ）")
    # load news
    if not os.path.exists(args.input):
        log.error(f"none input file：{args.input}"); return
    data = load_json(args.input)
    log.info(f"load data：{args.input}，total：{len(data)}")

    nobel_items = []
    for it in data:
        title = (it.get("title") or "").strip()
        text  = (it.get("text")  or "")
        date  = (it.get("date")  or "unknown").strip() or "unknown"
        if not title: continue
        if is_nobel_related(title, text) and is_year_2025(title, date, text):
            nobel_items.append({"date": date, "title": title, "source": domain_of(it.get("link","") or "")})

    if not timeline and not nobel_items:
        out = os.path.join(args.output_dir, "final_summary.txt")
        with open(out, "w", encoding="utf-8") as f:
            f.write(f"No{keyword}–related timeline or headlines were found.")
        log.info(f"written：{out}"); return

    system_prompt = OVERALL_PROMPT
    blocks = []

    if timeline:
        blocks.append("TIMELINE (authoritative):\n" + json.dumps(timeline, ensure_ascii=False, indent=2))

    if args.use_headlines and nobel_items:
        items_blob = json.dumps({"items": sorted(nobel_items, key=lambda x: x['date'])}, ensure_ascii=False)
        blocks.append("HEADLINES (secondary evidence):\n" + items_blob)

    user_msg = (
        "Use the timeline as ground truth. If any conflict arises, prefer the timeline.\n\n"
        + ("\n\n".join(blocks) if blocks else "No timeline provided; rely on headlines without fabrication.")
        + "\n\nRemember: Output EXACTLY THREE PARAGRAPHS separated by a blank line. "
          "Do not list dates or outlets; write flowing prose."
    )

    session = make_session()
    limiter = AdaptiveLimiter(qps=args.qps)

    try:
        raw = call_glm(session, limiter, args.api_key or os.getenv("ZHIPU_API_KEY",""),
                       [{"role":"system","content":system_prompt},
                        {"role":"user","content":user_msg}],
                       model=args.model, temperature=0.26, max_tokens=2400, timeout=120)
    except Exception as e:
        log.warning(f"generate fail：{e}")
        raw = ""

    if is_refusal(raw):
        final = ("could not generate")
    else:
        paras = [p for p in re.split(r'\n\s*\n', raw.strip()) if p.strip()]
        final = ("\n\n".join(re.sub(r'\s+',' ', p).strip() for p in paras[:3])
                 if len(paras) >= 3 else refactor_to_three_paragraphs(raw))

    out_path = os.path.join(args.output_dir, "final_summary.txt")
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(final)
    log.info(f"written：{out_path}\n finish")

## 6. Generate HTML Page
**generate_report**The generate_report module produces a clean, interactive HTML report from processed news data. It loads the final summary, entities, timeline, and cleaned articles, then renders them into structured sections including a hero header, research summary, key entities, event timeline, and news sources. The page features responsive design, Bootstrap styling, and interactive animations.

It uses helper functions for robust file loading and dynamically generates HTML cards and grids for clarity. Smooth scrolling and section highlighting enhance user experience. The resulting index.html provides an authoritative, readable view of keyword-related news insights.

This module bridges data engineering and front-end presentation, transforming datasets into a polished report. It ensures fault tolerance, maintainable code, and a professional platform for analysis and decision-making.

In [None]:
import json
from datetime import datetime

def load_json_file(filepath):
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return json.load(f)
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return None

def load_text_file(filepath):
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return ""

def generate_html_page(keywords, summary_text, entities_data, timeline_data, news_data, output_file):
    entities_html = ""
    if entities_data:
        if isinstance(entities_data, list):
            for award in entities_data:
                # Process award name (could be string or list)
                prize = award.get('prize', '')
                if isinstance(prize, list):
                    prize = ', '.join(prize) if prize else 'Unknown Award'
                
                entities_html += f'''
                <div class="col-md-6 col-lg-4">
                    <div class="entity-card">
                        <h5>{prize}</h5>
                        <div class="award-details">'''
                
                # Process people
                people = award.get('people', [])
                if people:
                    entities_html += '<div class="entity-section"><strong>Recipients:</strong><ul class="entity-list">'
                    for person in people:
                        entities_html += f'<li>{person}</li>'
                    entities_html += '</ul></div>'
                
                # Process organizations
                organizations = award.get('organizations', [])
                if organizations:
                    entities_html += '<div class="entity-section"><strong>Awarding Institutions:</strong><ul class="entity-list">'
                    for org in organizations:
                        entities_html += f'<li>{org}</li>'
                    entities_html += '</ul></div>'
                
                entities_html += '''
                        </div>
                    </div>
                </div>'''
    
        elif isinstance(entities_data, dict):
            # Keep original dictionary processing as backup
            for category, items in entities_data.items():
                entities_html += f'''
                <div class="col-md-6 col-lg-4">
                    <div class="entity-card">
                        <h5>{category}</h5>
                        <ul class="entity-list">'''
                if isinstance(items, list):
                    for item in items:
                        entities_html += f'<li>{item}</li>'
                entities_html += '''
                    </ul>
                </div>
            </div>'''

    timeline_html = ""
    if timeline_data:
        if isinstance(timeline_data, list):
            for event in timeline_data:
                if isinstance(event, dict):
                    date = event.get('date', event.get('time', 'Unknown Date'))
                    description = event.get('event', event.get('description', ''))
                    timeline_html += f'''
                    <div class="timeline-item">
                        <div class="timeline-content">
                            <span class="timeline-date">{date}</span>
                            <p class="mb-0">{description}</p>
                        </div>
                    </div>
                    '''
                else:
                    timeline_html += f'''
                    <div class="timeline-item">
                        <div class="timeline-content">
                            <p class="mb-0">{event}</p>
                        </div>
                    </div>'''
        elif isinstance(timeline_data, dict):
            for date, events in sorted(timeline_data.items()):
                if isinstance(events, list):
                    for event in events:
                        timeline_html += f'''
                        <div class="timeline-item">
                            <div class="timeline-content">
                                <span class="timeline-date">{date}</span>
                                <p class="mb-0">{event}</p>
                            </div>
                        </div>
                        '''
                else:
                    timeline_html += f'''
                    <div class="timeline-item">
                        <div class="timeline-content">
                            <span class="timeline-date">{date}</span>
                            <p class="mb-0">{events}</p>
                        </div>
                    </div>
                    '''

    sources_html = ""
    if news_data:
        if isinstance(news_data, list):
            for idx, article in enumerate(news_data, 10):
                if isinstance(article, dict):
                    title = article.get('title', f'Article {idx}')
                    url = article.get('url', article.get('link', '#'))

                    sources_html += f'''
                <div class="col-md-6">
                    <div class="card source-card custom-card">
                        <div class="card-body">
                            <h5 class="card-title">{title}</h5>
                            <a href="{url}" target="_blank" class="btn btn-outline-primary btn-sm">Read Article →</a>
                        </div>
                    </div>
                </div>'''

    html_template = f'''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>News Topic Summary - {keywords}</title>
    
    <!-- Bootstrap CSS - Local File -->
    <link rel="stylesheet" href="bootstrap.min.css">
    
    <!-- Custom Styles -->
    <style>
        :root {{
            --primary-color: #0d6efd;
            --secondary-color: #6c757d;
            --success-color: #198754;
            --info-color: #0dcaf0;
            --warning-color: #ffc107;
            --danger-color: #dc3545;
            --dark-color: #212529;
            --light-color: #f8f9fa;
        }}

        body {{
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Helvetica Neue', Arial, sans-serif;
        }}

        /* Top Bar */
        .top-bar {{
            background: var(--dark-color);
            color: white;
            padding: 10px 0;
            font-size: 0.9rem;
        }}

        .top-bar a {{
            color: white;
            text-decoration: none;
            margin-left: 15px;
        }}

        .top-bar a:hover {{
            color: var(--info-color);
        }}

        /* Custom Navbar */
        .navbar-brand {{
            font-weight: 700;
            font-size: 1.5rem;
        }}

        .navbar {{
            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
        }}

        /* Hero Section */
        .hero-section {{
            background: linear-gradient(135deg, #0d6efd 0%, #0a58ca 100%);
            color: white;
            padding: 100px 0;
            position: relative;
            overflow: hidden;
        }}

        .hero-section::before {{
            content: '';
            position: absolute;
            top: 0;
            left: 0;
            right: 0;
            bottom: 0;
            background: url('data:image/svg+xml,<svg width="100" height="100" xmlns="http://www.w3.org/2000/svg"><defs><pattern id="grid" width="100" height="100" patternUnits="userSpaceOnUse"><path d="M 100 0 L 0 0 0 100" fill="none" stroke="rgba(255,255,255,0.1)" stroke-width="1"/></pattern></defs><rect width="100%" height="100%" fill="url(%23grid)"/></svg>');
            opacity: 0.3;
        }}

        .hero-content {{
            position: relative;
            z-index: 1;
        }}

        .hero-section h1 {{
            font-size: 3.5rem;
            font-weight: 700;
            margin-bottom: 20px;
        }}

        .hero-section p {{
            font-size: 1.3rem;
            margin-bottom: 30px;
        }}

        .stat-card {{
            background: rgba(255,255,255,0.1);
            backdrop-filter: blur(10px);
            border-radius: 10px;
            padding: 30px;
            text-align: center;
            border: 1px solid rgba(255,255,255,0.2);
        }}

        .stat-number {{
            font-size: 3rem;
            font-weight: 700;
            display: block;
            margin-bottom: 10px;
        }}

        /* Section Headers */
        .section-header {{
            text-align: center;
            margin-bottom: 50px;
        }}

        .section-header h2 {{
            font-size: 2.5rem;
            font-weight: 700;
            color: var(--dark-color);
            margin-bottom: 15px;
            position: relative;
            display: inline-block;
        }}

        .section-header h2::after {{
            content: '';
            position: absolute;
            bottom: -10px;
            left: 50%;
            transform: translateX(-50%);
            width: 60px;
            height: 4px;
            background: var(--primary-color);
            border-radius: 2px;
        }}

        .section-header p {{
            color: var(--secondary-color);
            font-size: 1.1rem;
        }}

        /* Cards */
        .custom-card {{
            transition: transform 0.3s, box-shadow 0.3s;
            height: 100%;
            border: none;
            box-shadow: 0 2px 10px rgba(0,0,0,0.08);
        }}

        .custom-card:hover {{
            transform: translateY(-5px);
            box-shadow: 0 5px 20px rgba(0,0,0,0.15);
        }}

        /* Timeline */
        .timeline {{
            position: relative;
            padding: 20px 0;
        }}

        .timeline::before {{
            content: '';
            position: absolute;
            left: 50%;
            top: 0;
            bottom: 0;
            width: 2px;
            background: var(--primary-color);
            transform: translateX(-50%);
        }}

        .timeline-item {{
            position: relative;
            margin-bottom: 50px;
            width: 45%;
        }}

        .timeline-item:nth-child(odd) {{
            margin-left: 0;
            text-align: right;
        }}

        .timeline-item:nth-child(even) {{
            margin-left: 55%;
            text-align: left;
        }}

        .timeline-item::before {{
            content: '';
            position: absolute;
            width: 20px;
            height: 20px;
            background: var(--primary-color);
            border: 4px solid white;
            border-radius: 50%;
            box-shadow: 0 0 0 4px var(--primary-color);
            top: 20px;
        }}

        .timeline-item:nth-child(odd)::before {{
            right: -60px;
        }}

        .timeline-item:nth-child(even)::before {{
            left: -60px;
        }}

        .timeline-content {{
            background: white;
            padding: 20px;
            border-radius: 8px;
            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
        }}

        .timeline-date {{
            display: inline-block;
            background: var(--primary-color);
            color: white;
            padding: 5px 15px;
            border-radius: 20px;
            font-size: 0.9rem;
            font-weight: 600;
            margin-bottom: 10px;
        }}

        /* Entity Cards */
        .entity-card {{
            background: var(--light-color);
            border-left: 4px solid var(--primary-color);
            padding: 25px;
            border-radius: 8px;
            margin-bottom: 20px;
            transition: all 0.3s;
        }}

        .entity-card:hover {{
            background: white;
            box-shadow: 0 3px 15px rgba(0,0,0,0.1);
        }}

        .entity-card h5 {{
            color: var(--primary-color);
            font-weight: 600;
            margin-bottom: 15px;
            padding-bottom: 10px;
            border-bottom: 2px solid #dee2e6;
        }}

        .entity-list {{
            list-style: none;
            padding: 0;
        }}

        .entity-list li {{
            padding: 8px 0;
            color: var(--dark-color);
            display: flex;
            align-items: center;
        }}

        .entity-list li::before {{
            content: '▸';
            color: var(--primary-color);
            font-weight: bold;
            margin-right: 10px;
        }}

        /* Source Cards */
        .source-card {{
            border-left: 4px solid var(--success-color);
            transition: all 0.3s;
        }}

        .source-card:hover {{
            border-left-color: var(--primary-color);
        }}

        /* Footer */
        footer {{
            background: var(--dark-color);
            color: white;
            padding: 60px 0 20px;
            margin-top: 80px;
        }}

        footer h5 {{
            color: white;
            font-weight: 600;
            margin-bottom: 20px;
        }}

        footer a {{
            color: #adb5bd;
            text-decoration: none;
            transition: color 0.3s;
        }}

        footer a:hover {{
            color: var(--primary-color);
        }}

        footer .list-unstyled li {{
            margin-bottom: 10px;
        }}

        .footer-bottom {{
            border-top: 1px solid #495057;
            padding-top: 20px;
            margin-top: 40px;
            text-align: center;
            color: #adb5bd;
        }}

        /* Responsive Timeline */
        @media (max-width: 768px) {{
            .hero-section h1 {{
                font-size: 2rem;
            }}

            .timeline::before {{
                left: 20px;
            }}

            .timeline-item {{
                width: 100%;
                margin-left: 40px !important;
                text-align: left !important;
            }}

            .timeline-item::before {{
                left: -30px !important;
            }}
        }}

        /* Animations */
        .fade-in {{
            animation: fadeIn 0.8s ease-in;
        }}

        @keyframes fadeIn {{
            from {{
                opacity: 0;
                transform: translateY(20px);
            }}
            to {{
                opacity: 1;
                transform: translateY(0);
            }}
        }}
    </style>
</head>
<body>
    <!-- Top Bar -->
    <div class="top-bar">
        <div class="container">
            <div class="row">
                <div class="col-md-6">
                    <span>contact@newsresearch.org</span>
                    <span class="ms-3">+1 (555) 123-4567</span>
                </div>
                <div class="col-md-6 text-end">
                    <a href="#">中文</a>
                    <a href="#">English</a>
                </div>
            </div>
        </div>
    </div>

    <!-- Navigation Bar -->
    <nav class="navbar navbar-expand-lg navbar-light bg-white sticky-top">
        <div class="container">
            <a class="navbar-brand" href="#">
                <span class="badge bg-primary me-2">NR</span>
                {keywords} News Summary
            </a>
            <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav">
                <span class="navbar-toggler-icon"></span>
            </button>
            <div class="collapse navbar-collapse" id="navbarNav">
                <ul class="navbar-nav ms-auto">
                    <li class="nav-item">
                        <a class="nav-link active" href="#summary">Research Report</a>
                    </li>
                    <li class="nav-item">
                        <a class="nav-link" href="#entities">Key Entities</a>
                    </li>
                    <li class="nav-item">
                        <a class="nav-link" href="#timeline">Event Timeline</a>
                    </li>
                    <li class="nav-item">
                        <a class="nav-link" href="#sources">Resources</a>
                    </li>
                </ul>
            </div>
        </div>
    </nav>

    <!-- Hero Section -->
    <section class="hero-section">
        <div class="container hero-content">
            <div class="row">
                <div class="col-lg-8 mx-auto text-center">
                    <h1 class="fade-in">News Topic Comprehensive Report</h1>
                    <p class="lead fade-in">In-depth Analysis · Authoritative Sources · Professional Insights</p>
                </div>
            </div>
        </div>
    </section>

    <!-- Breadcrumb Navigation -->
    <nav aria-label="breadcrumb" class="bg-light py-3">
        <div class="container">
            <ol class="breadcrumb mb-0">
                <li class="breadcrumb-item"><a href="#">Home</a></li>
                <li class="breadcrumb-item"><a href="#">Research Reports</a></li>
                <li class="breadcrumb-item active">News Topic Report</li>
            </ol>
        </div>
    </nav>

    <!-- Main Summary Section -->
    <section id="summary" class="py-5">
        <div class="container">
            <div class="section-header">
                <h2>Research Report Summary</h2>
                <p>In-depth analysis of breakthrough achievements in the news topic</p>
            </div>
            
            <div class="row g-4">
                <div class="col-12">
                    <div class="card custom-card">
                        <div class="card-body">
                            <div style="white-space: pre-wrap; line-height: 1.8;">{summary_text if summary_text else 'No summary content available'}</div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </section>

    <!-- Key Entities Section -->
    <section id="entities" class="py-5 bg-light">
        <div class="container">
            <div class="section-header">
                <h2>Key Entities Directory</h2>
                <p>Important people and organizations related to the news topic</p>
            </div>
            
            <div class="row g-4">
                {entities_html if entities_html else '<div class="col-12"><p class="text-muted text-center">No entity data available</p></div>'}
            </div>
        </div>
    </section>

    <!-- Timeline Section -->
    <section id="timeline" class="py-5">
        <div class="container">
            <div class="section-header">
                <h2>Event Timeline</h2>
                <p>Chronology of major events</p>
            </div>
            
            <div class="timeline">
                {timeline_html if timeline_html else '<p class="text-muted text-center">No timeline data available</p>'}
            </div>
        </div>
    </section>

    <!-- Resources Section -->
    <section id="sources" class="py-5 bg-light">
        <div class="container">
            <div class="section-header">
                <h2>Resource Center</h2>
                <p>Authoritative news sources and reference materials</p>
            </div>
            
            <div class="row g-4">
                {sources_html if sources_html else '<div class="col-12"><p class="text-muted text-center">No news source data available</p></div>'}
            </div>
        </div>
    </section>

    <!-- Footer -->
    <footer>
        <div class="container">
            <div class="row">
                <div class="col-lg-3 col-md-6 mb-4">
                    <h5>About Us</h5>
                    <p>News Research Center is committed to providing authoritative and professional news research reports and in-depth analysis.</p>
                </div>
                <div class="col-lg-3 col-md-6 mb-4">
                    <h5>Quick Links</h5>
                    <ul class="list-unstyled">
                        <li><a href="#summary">Research Report</a></li>
                        <li><a href="#entities">Key Entities</a></li>
                        <li><a href="#timeline">Event Timeline</a></li>
                        <li><a href="#sources">Resources</a></li>
                    </ul>
                </div>
                <div class="col-lg-3 col-md-6 mb-4">
                    <h5>Contact Information</h5>
                    <ul class="list-unstyled">
                        <li>contact@newsresearch.org</li>
                        <li>+1 (555) 123-4567</li>
                        <li>New York, USA</li>
                    </ul>
                </div>
                <div class="col-lg-3 col-md-6 mb-4">
                    <h5>Data Sources</h5>
                    <p>This report data comes from multiple authoritative news agencies, collected, cleaned and analyzed through automated data pipelines.</p>
                </div>
            </div>
            <div class="footer-bottom">
                <p>&copy; 2025 News Research Center. All rights reserved. | This page was automatically generated by data engineering pipeline | Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
            </div>
        </div>
    </footer>

    <!-- Bootstrap JS - Local File -->
    <script src="bootstrap.bundle.min.js"></script>
    
    <!-- Custom Scripts -->
    <script>
        // Smooth scrolling
        document.querySelectorAll('a[href^="#"]').forEach(anchor => {{
            anchor.addEventListener('click', function (e) {{
                e.preventDefault();
                const target = document.querySelector(this.getAttribute('href'));
                if (target) {{
                    const headerOffset = 70;
                    const elementPosition = target.getBoundingClientRect().top;
                    const offsetPosition = elementPosition + window.pageYOffset - headerOffset;

                    window.scrollTo({{
                        top: offsetPosition,
                        behavior: 'smooth'
                    }});
                }}
            }});
        }});

        // Navbar active state
        window.addEventListener('scroll', () => {{
            let current = '';
            const sections = document.querySelectorAll('section[id]');
            
            sections.forEach(section => {{
                const sectionTop = section.offsetTop;
                if (pageYOffset >= sectionTop - 100) {{
                    current = section.getAttribute('id');
                }}
            }});

            document.querySelectorAll('.navbar-nav .nav-link').forEach(link => {{
                link.classList.remove('active');
                if (link.getAttribute('href') === `#${{current}}`) {{
                    link.classList.add('active');
                }}
            }});
        }});

        // Scroll animations
        const observerOptions = {{
            threshold: 0.1,
            rootMargin: '0px 0px -50px 0px'
        }};

        const observer = new IntersectionObserver((entries) => {{
            entries.forEach(entry => {{
                if (entry.isIntersecting) {{
                    entry.target.style.opacity = '1';
                    entry.target.style.transform = 'translateY(0)';
                }}
            }});
        }}, observerOptions);

        document.querySelectorAll('.custom-card, .entity-card, .timeline-item').forEach(el => {{
            el.style.opacity = '0';
            el.style.transform = 'translateY(20px)';
            el.style.transition = 'opacity 0.6s ease, transform 0.6s ease';
            observer.observe(el);
        }});
    </script>
</body>
</html>'''

    try:
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(html_template)
        print(f"HTML page successfully generated: {output_file}")
        return True
    except Exception as e:
        print(f"Error generating HTML file: {e}")
        return False

def generate_report(keywords):
    print("Starting HTML summary page generation...")
    print("Loading data files...")
    
    # Load data files from DataEngineering5481/DataEngineering directory
    summary_text = load_text_file('final_summary.txt')
    entities_data = load_json_file('entities.json')
    timeline_data = load_json_file('timeline.json')
    news_data = load_json_file('cleaned_news.json')

    success = generate_html_page(
        keywords,
        summary_text=summary_text,
        entities_data=entities_data,
        timeline_data=timeline_data,
        news_data=news_data,
        output_file='index.html'
    )

    if success:
        print("All operations completed!")
        print("Tip: Open index.html in your browser to view the results")
        print("Local Bootstrap files:")
        print("  - bootstrap.min.css")
        print("  - bootstrap.bundle.min.js")
    else:
        print("Issues encountered during generation, please check error messages")

## 7. Run pipeline

In [None]:
# Input event name:
keyword = "Nobel Prize" # 2025 Nobel Prize
# Input start time:
start_time = "2025-09-23" # “2025-09-23” 
# Input end time:
end_time = "2025-10-22" # “2025-10-22”

In [None]:
# Crawl the news
crawl_the_news(keyword, start_time, end_time)
# Clean the data
cleaned_the_data()

# Extract information
with open("cleaned_news.json", "r", encoding="utf-8") as f:
    data = json.load(f)
# Extract timeline
timeline = extract_timeline(data,keyword)
# save results
json.dump(timeline, open("timeline.json", "w", encoding="utf-8"), ensure_ascii=False, indent=2)
# Extract entities
entities = extract_entities(data,keyword)
# save results
json.dump(entities, open("entities.json", "w", encoding="utf-8"), ensure_ascii=False, indent=2)

# Summarize
summary(keyword)

# Generate HTML page
generate_report(keyword)