
# üîó Open Library ‚Äî Trending Books via Search API (REST)
**Goal:** fetch **600** books using the Open Library Search API with this query:

```
trending_score_hourly_sum:[1 TO *] -subject:"content_warning:cover" language:eng -subject:"content_warning:cover" -subject:"content_warning:cover"
```
Sorted by **trending**.

**Fields we want (if available):**
- Rank (position)
- Title
- Author
- Average rating
- Number of ratings
- Year (first publish year)
- Trending score (hourly sum)
- Book URL (work page)

**API docs (for reference):**
- Search API endpoint: `https://openlibrary.org/search.json` (supports `q`, `page`, `limit`, `sort`, `fields`).  
- `fields` lets us request specific fields like `ratings_average`, `ratings_count`, `first_publish_year`, etc.  


In [7]:
# üì¶ Imports
import requests
import pandas as pd
import time
import random
from bs4 import BeautifulSoup 

# üîß Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)

# üåê Endpoint
BASE = "https://openlibrary.org/search.json"

# üìù Query (given)
QUERY = 'trending_score_hourly_sum:[1 TO *] -subject:"content_warning:cover" language:eng -subject:"content_warning:cover" -subject:"content_warning:cover"'

# üî¢ Pagination
LIMIT = 100  # page size
TARGET = 600  # how many books we want
MAX_PAGES = 20  # safety cap (in case docs per page < LIMIT)

# üéØ Sort & Fields
# We request specific fields to ensure ratings/trending/year/author are returned.
FIELDS = [
    "key",
    "title",
    "author_name",
    "author_key",
    "first_publish_year",
    "ratings_average",
    "ratings_count",
    "trending_score_hourly_sum"
]
FIELDS_PARAM = ",".join(FIELDS)

# üåê Headers (polite: set a user-agent)
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0 Safari/537.36"
    )
}


## 1) Test request (page 1) ‚Äî sanity check
We call the API with our `q`, `sort=trending`, `page=1`, `limit=100`, and the `fields` list.


In [8]:
page = 1
params = {
    "q": QUERY,
    "sort": "trending",
    "page": page,
    "limit": LIMIT,
    "fields": FIELDS_PARAM
}
print("Requesting page", page)
r = requests.get(BASE, params=params, headers=headers)
print("Status code:", r.status_code)

data = r.json()
type(data), list(data.keys())[:5], data.get("numFound", None)

Requesting page 1
Status code: 200


(dict, ['numFound', 'start', 'numFoundExact', 'num_found', 'q'], 269710)

In [9]:
# Peek at the first 2 docs to see field availability
docs = data.get("docs", [])
print("Docs on page 1:", len(docs))
docs[:2]

Docs on page 1: 100


[{'author_key': ['OL543322A'],
  'author_name': ['Anthony Bourdain'],
  'first_publish_year': 2000,
  'key': '/works/OL3348011W',
  'title': 'Kitchen Confidential',
  'ratings_average': 4.0408163,
  'ratings_count': 49,
  'trending_score_hourly_sum': 132},
 {'author_key': ['OL76437A'],
  'author_name': ['Hermann Hesse'],
  'first_publish_year': 1922,
  'key': '/works/OL872932W',
  'title': 'Siddhartha',
  'ratings_average': 4.090909,
  'ratings_count': 55,
  'trending_score_hourly_sum': 254}]


## 2) Loop pages until 600 books
We collect docs across pages.

In [10]:
# Empty containers
ranks = []
titles = []
authors = []
author_urls = []
avg_ratings = []
num_ratings = []
years = []
trend_scores = []
book_urls = []
genres_list = []

total_collected = 0
global_rank = 0

for page in range(1, MAX_PAGES + 1):
    if total_collected >= TARGET:
        break

    params = {
        "q": QUERY,
        "sort": "trending",
        "page": page,
        "limit": LIMIT,
        "fields": FIELDS_PARAM
    }

    print(f"\nFetching page {page} ...")
    r = requests.get(BASE, params=params, headers=headers)
    print("Status:", r.status_code)
    data = r.json()
    docs = data.get("docs", [])
    print("Docs returned:", len(docs))

    if not docs:
        print("No more results.")
        break

    # Compute ranks and extract fields
    for i, d in enumerate(docs, start=1):
        global_rank += 1

        title = d.get("title") or ""
        
        # Autor (name)
        author_list = d.get("author_name") or []
        author = author_list[0] if len(author_list) > 0 else ""
        
        # Autor (URL)
        akeys = d.get("author_key") or []
        akey = akeys[0] if akeys else ""
        author_url = f"https://openlibrary.org/authors/{akey}" if akey else ""

        avg = d.get("ratings_average", None)
        cnt = d.get("ratings_count", None)

        year = d.get("first_publish_year", None)

        tscore = d.get("trending_score_hourly_sum", None)

        k = d.get("key", "")
        url = ""
        if isinstance(k, str):
            if k.startswith("/"):
                url = "https://openlibrary.org" + k
            else:
                url = "https://openlibrary.org/works/" + k
                
        # --- Genres from Open Library "Subjects" section ---
        book_genres = []
        if url:
            try:
                detail_res = requests.get(url, headers=headers)
                detail_soup = BeautifulSoup(detail_res.text, "html.parser")

                subjects_block = None

                # Find the <h3>Subjects</h3> section
                for h3 in detail_soup.find_all("h3"):
                    if h3.get_text(strip=True).lower() == "subjects":
                        # Usually the <a> tags with subjects are inside the same parent block
                        subjects_block = h3.parent
                        break

                if subjects_block:
                    subject_links = subjects_block.find_all("a", href=True)
                    for link in subject_links:
                        subj_text = link.get_text(strip=True)
                        if subj_text and subj_text not in book_genres:
                            book_genres.append(subj_text)

                # Small delay to be polite
                time.sleep(0.2)

            except:
                # if something fails, keep book_genres empty
                book_genres = []
                
        # Store genres as comma-separated string (same idea as Goodreads)
        if book_genres:
            genres_list.append(", ".join(book_genres))
        else:
            genres_list.append(None)

        ranks.append(global_rank)
        titles.append(title)
        authors.append(author)
        author_urls.append(author_url)
        avg_ratings.append(avg)
        num_ratings.append(cnt)
        years.append(year)
        trend_scores.append(tscore)
        book_urls.append(url)

        total_collected += 1
        if total_collected >= TARGET:
            break

    time.sleep(random.uniform(0.8, 1.6))

print("\nTotal collected:", total_collected)


Fetching page 1 ...
Status: 200
Docs returned: 100

Fetching page 2 ...
Status: 200
Docs returned: 100

Fetching page 3 ...
Status: 200
Docs returned: 100

Fetching page 4 ...
Status: 200
Docs returned: 100

Fetching page 5 ...
Status: 200
Docs returned: 100

Fetching page 6 ...
Status: 200
Docs returned: 100

Total collected: 600



## 3) Build the final DataFrame & clean types


In [11]:
import numpy as np

df = pd.DataFrame({
    "rank_in_page": ranks,
    "title": titles,
    "author": authors,
    "author_url": author_urls,
    "avg_rating": avg_ratings,
    "num_ratings": num_ratings,
    "year": years,
    "score": trend_scores,
    "book_url": book_urls,
    "genres": genres_list
})

df["avg_rating"] = pd.to_numeric(df["avg_rating"], errors="coerce").round(2)
df["num_ratings"] = pd.to_numeric(df["num_ratings"], errors="coerce", downcast="integer")
df["year"] = pd.to_numeric(df["year"], errors="coerce", downcast="integer")
df["score"] = pd.to_numeric(df["score"], errors="coerce")

df.head(12)

Unnamed: 0,rank_in_page,title,author,author_url,avg_rating,num_ratings,year,score,book_url,genres
0,1,Kitchen Confidential,Anthony Bourdain,https://openlibrary.org/authors/OL543322A,4.04,49.0,2000.0,132,https://openlibrary.org/works/OL3348011W,"Cooks, Cocineros, History, New York Times best..."
1,2,Siddhartha,Hermann Hesse,https://openlibrary.org/authors/OL76437A,4.09,55.0,1922.0,254,https://openlibrary.org/works/OL872932W,"Alegor√≠as, Buddha (The concept), Buddha and Bu..."
2,3,Silence,Sh≈´saku End≈ç,https://openlibrary.org/authors/OL4282449A,4.17,6.0,1980.0,91,https://openlibrary.org/works/OL15391655W,"Fiction, Christians, History, Missionaries in ..."
3,4,Phantastes,George MacDonald,https://openlibrary.org/authors/OL23082A,3.82,11.0,1850.0,207,https://openlibrary.org/works/OL15450W,"Fairy tales, Scottish Fantasy fiction, Fiction..."
4,5,The First Man in Rome,Colleen McCullough,https://openlibrary.org/authors/OL225331A,4.44,9.0,1990.0,63,https://openlibrary.org/works/OL1882554W,"Fiction, historical, Rome, fiction, Fiction, H..."
...,...,...,...,...,...,...,...,...,...,...
7,8,Almost a Stranger,Margaret Way,https://openlibrary.org/authors/OL1175057A,3.20,10.0,1984.0,73,https://openlibrary.org/works/OL5236086W,"Fiction, Romance, Contemporary"
8,9,John Donne Poetry,John Donne,https://openlibrary.org/authors/OL123428A,4.80,5.0,1633.0,42,https://openlibrary.org/works/OL15420303W,"Criticism and interpretation, English Christia..."
9,10,Die Verwandlung,Franz Kafka,https://openlibrary.org/authors/OL33146A,4.10,129.0,1915.0,176,https://openlibrary.org/works/OL498556W,"Fantasy fiction, Children's fiction, Lectures ..."
10,11,The Invisible Man,H. G. Wells,https://openlibrary.org/authors/OL13066A,3.81,97.0,0.0,145,https://openlibrary.org/works/OL52266W,"Ciencia-ficci√≥n, Classic Literature, Fiction, ..."



## 4) Save to CSV


In [12]:
out_csv = "..\data\openlibrary_trending_600.csv"
df.to_csv(out_csv, index=False, encoding="utf-8")
print("Saved:", out_csv, " ‚Äî rows:", len(df))

Saved: ..\data\openlibrary_trending_600.csv  ‚Äî rows: 600



### ‚úÖ Notes
- We used `page` + `limit` for pagination. `page` starts at 1. We used `limit=100` for fewer requests.
- We passed `fields=...` to **explicitly** request ratings + year + trending fields. If some fields are missing, they will be `NaN`.
- `rank` is our running index across pages (since API returns results already sorted by `trending`).
- To enlarge or reduce, change `TARGET` or `LIMIT`.
- To strictly work in English, we're filtering with `language:eng` in the `QUERY` string.

### ‚ÑπÔ∏è Field mapping to the table requirements
- **Rank** ‚Üí `rank` (running counter)
- **Book Title** ‚Üí `title`
- **Author** ‚Üí first value of `author_name`
- **Author URL** ‚Üí first value of `author_key`
- **Average rating** ‚Üí `ratings_average`
- **Number of ratings** ‚Üí `ratings_count`
- **Year** ‚Üí `first_publish_year`
- **Score** ‚Üí `trending_score_hourly_sum` (activity score; 24h aggregate)
- **Book URL** ‚Üí built from `key` (work URL)

### Next steps
- Join with the Goodreads CSV to create single dataset.
- Add more fields via `fields=` (e.g., `edition_count`, `readinglog_count`, `cover_i`).
