
# üîó Open Library ‚Äî Trending Books via Search API (REST)
**Goal:** fetch **~500** books using the Open Library Search API with this query:

```
trending_score_hourly_sum:[1 TO *] -subject:"content_warning:cover" language:eng -subject:"content_warning:cover" -subject:"content_warning:cover"
```
Sorted by **trending**.

**Fields we want (if available):**
- Rank (position)
- Title
- Author
- Average rating
- Number of ratings
- Year (first publish year)
- Trending score (hourly sum)
- Book URL (work page)

**API docs (for reference):**
- Search API endpoint: `https://openlibrary.org/search.json` (supports `q`, `page`, `limit`, `sort`, `fields`).  
- `fields` lets us request specific fields like `ratings_average`, `ratings_count`, `first_publish_year`, etc.  


In [15]:
# üì¶ Imports
import requests
import pandas as pd
import time
import random

# üîß Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)

# üåê Endpoint
BASE = "https://openlibrary.org/search.json"

# üìù Query (given)
QUERY = 'trending_score_hourly_sum:[1 TO *] -subject:"content_warning:cover" language:eng -subject:"content_warning:cover" -subject:"content_warning:cover"'

# üî¢ Pagination
LIMIT = 100  # page size
TARGET = 500  # how many books we want
MAX_PAGES = 20  # safety cap (in case docs per page < LIMIT)

# üéØ Sort & Fields
# We request specific fields to ensure ratings/trending/year/author are returned.
FIELDS = [
    "key",
    "title",
    "author_name",
    "author_key",
    "first_publish_year",
    "ratings_average",
    "ratings_count",
    "trending_score_hourly_sum"
]
FIELDS_PARAM = ",".join(FIELDS)

# üåê Headers (polite: set a user-agent)
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0 Safari/537.36"
    )
}


## 1) Test request (page 1) ‚Äî sanity check
We call the API with our `q`, `sort=trending`, `page=1`, `limit=100`, and the `fields` list.


In [16]:
page = 1
params = {
    "q": QUERY,
    "sort": "trending",
    "page": page,
    "limit": LIMIT,
    "fields": FIELDS_PARAM
}
print("Requesting page", page)
r = requests.get(BASE, params=params, headers=headers)
print("Status code:", r.status_code)

data = r.json()
type(data), list(data.keys())[:5], data.get("numFound", None)

Requesting page 1
Status code: 200


(dict, ['numFound', 'start', 'numFoundExact', 'num_found', 'q'], 270929)

In [17]:
# Peek at the first 2 docs to see field availability
docs = data.get("docs", [])
print("Docs on page 1:", len(docs))
docs[:2]

Docs on page 1: 100


[{'author_key': ['OL18319A'],
  'author_name': ['Mark Twain'],
  'first_publish_year': 1889,
  'key': '/works/OL54031W',
  'title': "A Connecticut Yankee in King Arthur's Court",
  'ratings_average': 3.7142856,
  'ratings_count': 14,
  'trending_score_hourly_sum': 420},
 {'author_key': ['OL9388A'],
  'author_name': ['William Shakespeare'],
  'first_publish_year': 1611,
  'key': '/works/OL362699W',
  'title': 'The Tempest',
  'ratings_average': 3.9649122,
  'ratings_count': 57,
  'trending_score_hourly_sum': 493}]


## 2) Loop pages until 500 books
We collect docs across pages.

In [18]:
# Empty containers
ranks = []
titles = []
authors = []
author_urls = []
avg_ratings = []
num_ratings = []
years = []
trend_scores = []
book_urls = []

total_collected = 0
global_rank = 0

for page in range(1, MAX_PAGES + 1):
    if total_collected >= TARGET:
        break

    params = {
        "q": QUERY,
        "sort": "trending",
        "page": page,
        "limit": LIMIT,
        "fields": FIELDS_PARAM
    }

    print(f"\nFetching page {page} ...")
    r = requests.get(BASE, params=params, headers=headers)
    print("Status:", r.status_code)
    data = r.json()
    docs = data.get("docs", [])
    print("Docs returned:", len(docs))

    if not docs:
        print("No more results.")
        break

    # Compute ranks and extract fields
    for i, d in enumerate(docs, start=1):
        global_rank += 1

        title = d.get("title") or ""
        
        # Autor (name)
        author_list = d.get("author_name") or []
        author = author_list[0] if len(author_list) > 0 else ""
        
        # Autor (URL)
        akeys = d.get("author_key") or []
        akey = akeys[0] if akeys else ""
        author_url = f"https://openlibrary.org/authors/{akey}" if akey else ""

        avg = d.get("ratings_average", None)
        cnt = d.get("ratings_count", None)

        year = d.get("first_publish_year", None)

        tscore = d.get("trending_score_hourly_sum", None)

        k = d.get("key", "")
        url = ""
        if isinstance(k, str):
            if k.startswith("/"):
                url = "https://openlibrary.org" + k
            else:
                url = "https://openlibrary.org/works/" + k

        ranks.append(global_rank)
        titles.append(title)
        authors.append(author)
        author_urls.append(author_url)
        avg_ratings.append(avg)
        num_ratings.append(cnt)
        years.append(year)
        trend_scores.append(tscore)
        book_urls.append(url)

        total_collected += 1
        if total_collected >= TARGET:
            break

    time.sleep(random.uniform(0.8, 1.6))

print("\nTotal collected:", total_collected)



Fetching page 1 ...
Status: 200
Docs returned: 100

Fetching page 2 ...
Status: 200
Docs returned: 100

Fetching page 3 ...
Status: 200
Docs returned: 100

Fetching page 4 ...
Status: 200
Docs returned: 100

Fetching page 5 ...
Status: 200
Docs returned: 100

Total collected: 500



## 3) Build the final DataFrame & clean types


In [19]:
import numpy as np

df = pd.DataFrame({
    "rank_in_page": ranks,
    "title": titles,
    "author": authors,
    "author_url": author_urls,
    "avg_rating": avg_ratings,
    "num_ratings": num_ratings,
    "year": years,
    "score": trend_scores,
    "book_url": book_urls
})

df["avg_rating"] = pd.to_numeric(df["avg_rating"], errors="coerce").round(2)
df["num_ratings"] = pd.to_numeric(df["num_ratings"], errors="coerce", downcast="integer")
df["year"] = pd.to_numeric(df["year"], errors="coerce", downcast="integer")
df["score"] = pd.to_numeric(df["score"], errors="coerce")

df.head(12)


Unnamed: 0,rank_in_page,title,author,author_url,avg_rating,num_ratings,year,score,book_url
0,1,A Connecticut Yankee in King Arthur's Court,Mark Twain,https://openlibrary.org/authors/OL18319A,3.71,14.0,1889,420,https://openlibrary.org/works/OL54031W
1,2,The Tempest,William Shakespeare,https://openlibrary.org/authors/OL9388A,3.96,57.0,1611,493,https://openlibrary.org/works/OL362699W
2,3,Bleak House,Charles Dickens,https://openlibrary.org/authors/OL24638A,3.93,14.0,1850,148,https://openlibrary.org/works/OL14868510W
3,4,Logische Untersuchungen,Edmund Husserl,https://openlibrary.org/authors/OL132405A,5.00,1.0,1900,313,https://openlibrary.org/works/OL1304069W
4,5,The Magic Finger,Roald Dahl,https://openlibrary.org/authors/OL34184A,4.00,29.0,1966,82,https://openlibrary.org/works/OL45876W
...,...,...,...,...,...,...,...,...,...
7,8,–ü—Ä–µ—Å—Ç—É–ø–ª–µ–Ω–∏–µ –∏ –Ω–∞–∫–∞–∑–∞–Ω–∏–µ,–§—ë–¥–æ—Ä –ú–∏—Ö–∞–π–ª–æ–≤–∏—á –î–æ—Å—Ç–æ–µ–≤—Å–∫–∏–π,https://openlibrary.org/authors/OL22242A,4.21,102.0,1866,163,https://openlibrary.org/works/OL166894W
8,9,The Tao of sexology,Stephen T. Chang,https://openlibrary.org/authors/OL950704A,5.00,1.0,1986,72,https://openlibrary.org/works/OL4631952W
9,10,Evening Class,Maeve Binchy,https://openlibrary.org/authors/OL21305A,3.50,2.0,1994,80,https://openlibrary.org/works/OL56771W
10,11,Midnight's Children,Salman Rushdie,https://openlibrary.org/authors/OL26769A,3.89,37.0,1981,43,https://openlibrary.org/works/OL457179W



## 4) Save to CSV


In [20]:

out_csv = "..\data\openlibrary_trending_500.csv"
df.to_csv(out_csv, index=False, encoding="utf-8")
print("Saved:", out_csv, " ‚Äî rows:", len(df))


Saved: ..\data\openlibrary_trending_500.csv  ‚Äî rows: 500



### ‚úÖ Notes
- We used `page` + `limit` for pagination. `page` starts at 1. We used `limit=100` for fewer requests.
- We passed `fields=...` to **explicitly** request ratings + year + trending fields. If some fields are missing, they will be `NaN`.
- `rank` is our running index across pages (since API returns results already sorted by `trending`).
- To enlarge or reduce, change `TARGET` or `LIMIT`.
- To strictly work in English, we're filtering with `language:eng` in the `QUERY` string.

### ‚ÑπÔ∏è Field mapping to the table requirements
- **Rank** ‚Üí `rank` (our running counter)
- **Book Title** ‚Üí `title`
- **Author** ‚Üí first value of `author_name`
- **Author URL** ‚Üí first value of `author_key`
- **Average rating** ‚Üí `ratings_average`
- **Number of ratings** ‚Üí `ratings_count`
- **Year** ‚Üí `first_publish_year`
- **Score** ‚Üí `trending_score_hourly_sum` (activity score; 24h aggregate)
- **Book URL** ‚Üí built from `key` (work URL)

### Next steps
- Join with the Goodreads CSV if you want a single dataset.
- Add more fields via `fields=` (e.g., `edition_count`, `readinglog_count`, `cover_i`).
