This notebook crawls the publication from https://www.hkiaps.cuhk.edu.hk/hkiaps-publications-publication-list/


In [None]:
import csv
import time
from pathlib import Path
import httpx
from bs4 import BeautifulSoup


def scrape_publications(total_pages: int = 42) -> list[dict[str, str]]:
    """Scrape publication details including all metadata from all pages.

    Args:
        total_pages (int): Number of pages to scrape

    Returns:
        List[Dict[str, str]]: List of publication dictionaries with complete metadata

    """
    publications = []
    base_url = "https://www.hkiaps.cuhk.edu.hk/hkiaps-publications-publication-list/"

    for page in range(1, total_pages + 1):
        url = f"{base_url}?current_page={page}&filterCat[]=all&filterCat[]=asia-pacific-in-the-21st-century-book-series&filterCat[]=occasional-paper-series&filterCat[]=policy-research-report-series&filterCat[]=public-policy-forum-series&filterCat[]=research-monograph-series&filterCat[]=universities-service-centre-seminar-series&filterYear=all"
        print(f"Scraping page {page}/{total_pages}...")

        try:
            response = httpx.get(url, timeout=10, headers={"User-Agent": "Mozilla/5.0"})
            response.raise_for_status()
            soup = BeautifulSoup(response.content, "html.parser")
            containers = soup.find_all("div", class_="bg-fafafa")

            for container in containers:
                pub = {}
                headlines = container.find_all("div", class_="Headline3")
                if len(headlines) >= 2:
                    pub["id"] = headlines[0].get_text(strip=True)
                    pub["title"] = headlines[1].get_text(strip=True)
                buttons = container.find_all("a")
                for button in buttons:
                    href = button.get("href", "")
                    text = button.get_text(strip=True)
                    if "Abstract" in text:
                        pub["abstract_url"] = (
                            href
                            if href.startswith("http")
                            else f"https://www.hkiaps.cuhk.edu.hk{href}"
                        )
                    elif "Table of Contents" in text and href.endswith(".pdf"):
                        pub["toc_pdf"] = href
                    elif "PDF" in text and href.endswith(".pdf"):
                        pub["pdf_url"] = href
                info_div = container.find("div", class_="Body2")
                if info_div:
                    pub["metadata"] = info_div.get_text(separator="\n", strip=True)
                img = container.find("img", class_="publicationsImg")
                if img:
                    pub["image_url"] = img.get("src", "")
                publications.append(pub)
                print(f"  - {pub.get('id', 'N/A')}: {pub.get('title', 'N/A')[:50]}...")

            print(f"Found {len(containers)} publications on page {page}")
            time.sleep(5)

        except Exception as e:
            print(f"Error on page {page}: {e}")

    return publications


# Main execution
if __name__ == "__main__":
    print("Starting scraper...\n")
    publications = scrape_publications(total_pages=42) # Adjust total_pages as needed

Starting scraper...

Scraping page 1/42...
  - RM118: 潮汕人與一帶一路：金融教育領軍人...
  - PRR09: 做強香港文化產業：軟實力的視角...
  - PRR08: 做強香港家族辦公室業務：歷史、社會與文化的研究...
  - RM117: 《潮領香江》...
  - RM116: Building a Sustainable Healthcare System for Hong ...
  - PRR07: Building a Sustainable Healthcare System for Hong ...
  - RM115: 《潮汕人與一帶一路：商業貿易的開拓》...
  - PRR06: 全球潮人與一帶一路學術政策國際論壇：論壇背景、紀要與政策建議...
  - RM114: 《放寬香港視野：美國國家安全法研究》...
  - PRR05: 香港青年社會流動研究：住戶統計調查的分析...
Found 10 publications on page 1
Scraping page 2/42...
  - RM113: 《2020中國效應：台港民眾的態度變遷》...
  - PRR04: 抗疫路上：香港市民眼中的新冠疫情衝擊與應變研究...
  - RM112: 《勾勒與比較台港社會意索》...
  - PRR03: 跨境就學就業的性別歧視及性騷擾問題...
  - OP244(T): 被害者視角：刑法何以保護人工智能體？[繁體中文]...
  - OP244(S): 被害者视角：刑法何以保护人工智能体？[简体中文]...
  - PRR02: 顧己及人：推動正確行為以改善公廁衞生...
  - OP243(T): 可否消滅貧窮？近二百年來貧窮面貌的變化[繁體中文]...
  - OP243(S): 可否消灭贫穷？近二百年来贫穷面貌的变化[简体中文]...
  - RM111: 《香港與台灣的社會政治新動向》...
Found 10 publications on page 2
Scraping page 3/42...
  - OP242(T): 商談理性與刑事庭審實質化改革研究：基於刑法與刑事訴訟法交叉的視角[繁體中文]...
  - OP242(S): 商谈理性与刑事庭审实质化改革研究：

FileNotFoundError: [Errno 2] No such file or directory: 'hkiaps-publications/hkiaps_publications.csv'

In [4]:
def save_to_csv(
    publications: list[dict[str, str]], dir_name: str, filename: str = "hkiaps_publications.csv"
) -> None:
    """Save publications to CSV file.

    Args:
        publications (List[Dict[str, str]]): List of publication dictionaries
        filename (str): Output CSV filename

    Returns:
        None: Writes data to CSV file

    """
    if not publications:
        print("No publications to save")
        return
    fieldnames = [
        "id",
        "title",
        "metadata",
        "pdf_url",
        "toc_pdf",
        "abstract_url",
        "image_url",
    ]
    filepath = Path(dir_name) / filename
    filepath.parent.mkdir(parents=True, exist_ok=True)
    with Path(filepath).open("w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(publications)
    print(f"\nSaved {len(publications)} publications to {filename}")


save_to_csv(publications, dir_name="hkiaps-publications")
print("\n=== Summary ===")
print(f"Total publications: {len(publications)}")


Saved 414 publications to hkiaps_publications.csv

=== Summary ===
Total publications: 414
