
# üìö Goodreads ‚Äî *Best Books Ever* (Scraping)
**Goal:** scrape the list **Best Books Ever** from Goodreads and build a CSV with the first **500** books (about 5 pages).  

> Target page: https://www.goodreads.com/list/show/1.Best_Books_Ever

**Fields we aim to collect per book (if available on the page):**
- Rank (position on the list)
- Title
- Author
- Average rating
- Number of ratings
- Year (if shown)
- Score / Votes
- Book URL


In [None]:

# üì¶ Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time
import random

# üîß Display and warning settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
import warnings
warnings.filterwarnings('ignore')

# üåê Headers (polite: set a user-agent)
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0 Safari/537.36"
    )
}

BASE_URL = "https://www.goodreads.com"
LIST_URL = "https://www.goodreads.com/list/show/1.Best_Books_Ever"



## 1) Fetch page 1 and inspect
We start with page 1 to confirm the structure, then we will generalize to multiple pages.


In [19]:

# Request page 1
page = 1
url = LIST_URL + f"?page={page}"
print("Requesting:", url)
response = requests.get(url, headers=headers)
print("Status code:", response.status_code)

# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")

# Try to locate the main table that contains the list
table = soup.find("table", {"class": "tableList"})
type(table), (str(table)[:300] + "...") if table else "table not found"


Requesting: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1
Status code: 200


(bs4.element.Tag,
 '<table class="tableList js-dataTooltip">\n<!-- Add query string params -->\n<tr itemscope="" itemtype="http://schema.org/Book">\n<td class="number" valign="top">1</td>\n<td valign="top" width="5%">\n<div class="u-anchorTarget" id="2767052"></div>\n<div class="js-tooltipTrigger tooltipTrigger" data-resourc...')


## 2) Parse rows on a single page
Each book is typically a `tr` row inside the table with class `tableList`.
We'll collect columns with very **defensive parsing** (lots of `if` checks) so it doesn't break if some field is missing.


In [20]:

# Temporary lists (single page demo first)
ranks = []
titles = []
authors = []
avg_ratings = []
num_ratings = []
years = []
scores = []
votes = []
book_urls = []

rows = table.select("tr") if table else []
print("Rows found on page", page, ":", len(rows))

# We'll use a running counter for rank within the page (the site usually shows it, but we'll keep a simple counter)
local_rank = 0

for row in rows:
    local_rank += 1
    
    # Title & URL
    a_title = row.select_one("a.bookTitle")  # sometimes text is inside a span
    title_text = ""
    rel_url = ""
    if a_title:
        # new Goodreads often wraps the visible text in a nested <span>
        # we fall back to the anchor text if no span
        span_title = a_title.select_one("span")
        if span_title and span_title.get_text(strip=True):
            title_text = span_title.get_text(strip=True)
        else:
            title_text = a_title.get_text(strip=True)
        rel_url = a_title.get("href", "")

    # Author
    a_author = row.select_one("a.authorName")
    author_text = ""
    if a_author:
        span_author = a_author.select_one("span")
        if span_author and span_author.get_text(strip=True):
            author_text = span_author.get_text(strip=True)
        else:
            author_text = a_author.get_text(strip=True)

    # Ratings (e.g., "4.28 avg rating ‚Äî 7,534,822 ratings")
    mini = row.select_one("span.minirating")
    avg_val = None
    num_val = None
    if mini:
        t = mini.get_text(" ", strip=True)
        # average rating
        m1 = re.search(r"([0-9]\.[0-9]+)\s*avg rating", t)
        if m1:
            try:
                avg_val = float(m1.group(1))
            except:
                avg_val = None
        # number of ratings
        m2 = re.search(r"([0-9][0-9,]*)\s*ratings", t)
        if m2:
            try:
                num_val = int(m2.group(1).replace(",", ""))
            except:
                num_val = None

    # Year (sometimes in small grey text "published 19xx" / "first published 19xx")
    # small_year = row.select_one("span.greyText.smallText.uitext")
    year_val = None
    # if small_year:
    #     m3 = re.search(r"(19\d{2}|20\d{2})", small_year.get_text(" ", strip=True))
    #     if m3:
    #         try:
    #             year_val = int(m3.group(1))
    #         except:
    #             year_val = None

    # Score / Votes (sometimes appears in the same small text block with "score:" / "voters")
    score_val = None
    votes_val = None
    smalls = row.select("span.smallText.uitext")  # ‚Üê clases con punto
    for s in smalls:
        st = s.get_text(" ", strip=True).lower()

        # score: 276,401
        m4 = re.search(r"score:\s*([\d,]+)", st)
        if m4 and score_val is None:
            try:
                score_val = int(m4.group(1).replace(",", ""))
            except:
                score_val = None

        # 2,965 people voted
        m5 = re.search(r"([\d,]+)\s*people voted", st)
        if m5 and votes_val is None:
            try:
                votes_val = int(m5.group(1).replace(",", ""))
            except:
                votes_val = None

    # Absolute URL
    abs_url = ""
    if rel_url:
        if rel_url.startswith("http"):
            abs_url = rel_url
        else:
            abs_url = BASE_URL + rel_url

    # Append
    ranks.append(local_rank)
    titles.append(title_text)
    authors.append(author_text)
    avg_ratings.append(avg_val)
    num_ratings.append(num_val)
    years.append(year_val)
    scores.append(score_val)
    votes.append(votes_val)
    book_urls.append(abs_url)

# Build DataFrame for the single page
df_page1 = pd.DataFrame({
    "rank_in_page": ranks,
    "title": titles,
    "author": authors,
    "avg_rating": avg_ratings,
    "num_ratings": num_ratings,
    "year": years,
    "score": scores,
    "votes": votes,
    "book_url": book_urls
})

df_page1.head(10)


Rows found on page 1 : 100


Unnamed: 0,rank_in_page,title,author,avg_rating,num_ratings,year,score,votes,book_url
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,4.35,9744424,,4280600,43518,https://www.goodreads.com/book/show/2767052-th...
1,2,Pride and Prejudice,Jane Austen,4.29,4720297,,2944964,30181,https://www.goodreads.com/book/show/1885.Pride...
2,3,To Kill a Mockingbird,Harper Lee,4.26,6782110,,2588126,26428,https://www.goodreads.com/book/show/2657.To_Ki...
3,4,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.5,3743795,,2069966,21060,https://www.goodreads.com/book/show/58613451-h...
4,5,The Book Thief,Markus Zusak,4.39,2839327,,1957878,20110,https://www.goodreads.com/book/show/19063.The_...
5,6,"Twilight (The Twilight Saga, #1)",Stephenie Meyer,3.67,7216059,,1753140,17869,https://www.goodreads.com/book/show/41865.Twil...
6,7,Animal Farm,George Orwell,4.01,4452034,,1699839,17591,https://www.goodreads.com/book/show/170448.Ani...
7,8,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien,4.61,142891,,1641371,17004,https://www.goodreads.com/book/show/30.J_R_R_T...
8,9,The Chronicles of Narnia (The Chronicles of Na...,C.S. Lewis,4.28,701346,,1526109,15895,https://www.goodreads.com/book/show/11127.The_...
9,10,The Fault in Our Stars,John Green,4.12,5648104,,1410235,14593,https://www.goodreads.com/book/show/11870085-t...


In [21]:
print(smalls)

[<span class="greyText smallText uitext">
<span class="minirating"><span class="stars staticStars notranslate"><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p3" size="12x12"></span></span> 4.40 avg rating ‚Äî 293,812 ratings</span>
</span>, <span class="smallText uitext">
<a href="#" onclick="Lightbox.showBoxByID('score_explanation', 300); return false;">score: 283,194</a>,
              <span class="greyText">and</span>
<a href="#" id="loading_link_220283" onclick="new Ajax.Request('/list/list_book/4165501', {asynchronous:true, evalScripts:true, onFailure:function(request){Element.hide('loading_anim_220283');$('loading_link_220283').innerHTML = '&lt;span class=&quot;error&quot;&gt;ERROR&lt;/span&gt;try again';$('loading_link_220283').show();;Element.hide('loading_anim_220283');}, onLoading:function(request){;Eleme


## 3) Loop over the first 5 pages (500 books)
Now we repeat the same logic for pages 1 to 5 and combine everything into a single DataFrame.
We add a short random `sleep` between requests to be polite.


In [22]:

# Global containers (all pages)
all_ranks = []
all_titles = []
all_authors = []
all_avg_ratings = []
all_num_ratings = []
all_years = []
all_scores = []
all_votes = []
all_book_urls = []

# We'll compute a global rank across pages as we go
global_rank = 0

for page in range(1, 6):  # pages 1..5  (100 books per page => 500 books)
    url = LIST_URL + f"?page={page}"
    print(f"\nFetching page {page}: {url}")
    response = requests.get(url, headers=headers)
    print("Status:", response.status_code)
    soup = BeautifulSoup(response.text, "html.parser")
    table = soup.find("table", {"class": "tableList"})
    rows = table.select("tr") if table else []
    print("Rows:", len(rows))

    for row in rows:
        global_rank += 1
    
        # Title & URL
        a_title = row.select_one("a.bookTitle")
        title_text = ""
        rel_url = ""
        if a_title:
            span_title = a_title.select_one("span")
            if span_title and span_title.get_text(strip=True):
                title_text = span_title.get_text(strip=True)
            else:
                title_text = a_title.get_text(strip=True)
            rel_url = a_title.get("href", "")
    
        # Author
        a_author = row.select_one("a.authorName")
        author_text = ""
        if a_author:
            span_author = a_author.select_one("span")
            if span_author and span_author.get_text(strip=True):
                author_text = span_author.get_text(strip=True)
            else:
                author_text = a_author.get_text(strip=True)
    
        # Ratings
        mini = row.select_one("span.minirating")
        avg_val = None
        num_val = None
        if mini:
            t = mini.get_text(" ", strip=True)
            m1 = re.search(r"([0-9]\.[0-9]+)\s*avg rating", t)
            if m1:
                try:
                    avg_val = float(m1.group(1))
                except:
                    avg_val = None
            m2 = re.search(r"([0-9][0-9,]*)\s*ratings", t)
            if m2:
                try:
                    num_val = int(m2.group(1).replace(",", ""))
                except:
                    num_val = None
    
        # Year
        # small_year = row.select_one("span.greyText.smallText.uitext")
        year_val = None
        # if small_year:
        #     m3 = re.search(r"(19\d{2}|20\d{2})", small_year.get_text(" ", strip=True))
        #     if m3:
        #         try:
        #             year_val = int(m3.group(1))
        #         except:
        #             year_val = None
    
        # Score / Votes
        score_val = None
        votes_val = None
        smalls = row.select("span.smallText.uitext")  # ‚Üê clases con punto
        for s in smalls:
            st = s.get_text(" ", strip=True).lower()
            m4 = re.search(r"score:\s*([\d,]+)", st)
            if m4 and score_val is None:
                try:
                    score_val = int(m4.group(1).replace(",", ""))
                except:
                    score_val = None
            m5 = re.search(r"([\d,]+)\s*people voted", st)
            if m5 and votes_val is None:
                try:
                    votes_val = int(m5.group(1).replace(",", ""))
                except:
                    votes_val = None
            
        # Absolute URL
        abs_url = ""
        if rel_url:
            if rel_url.startswith("http"):
                abs_url = rel_url
            else:
                abs_url = BASE_URL + rel_url
    
        # Append
        all_ranks.append(global_rank)
        all_titles.append(title_text)
        all_authors.append(author_text)
        all_avg_ratings.append(avg_val)
        all_num_ratings.append(num_val)
        all_years.append(year_val)
        all_scores.append(score_val)
        all_votes.append(votes_val)
        all_book_urls.append(abs_url)

    # Polite sleep between pages
    time.sleep(random.uniform(1.0, 2.5))

len(all_ranks), len(all_titles)



Fetching page 1: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1
Status: 200
Rows: 100

Fetching page 2: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=2
Status: 200
Rows: 100

Fetching page 3: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=3
Status: 200
Rows: 100

Fetching page 4: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=4
Status: 200
Rows: 100

Fetching page 5: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=5
Status: 200
Rows: 100


(500, 500)


## 4) Build the final DataFrame and save to CSV
We keep numeric columns as proper numbers and export everything.


In [23]:

df = pd.DataFrame({
    "rank": all_ranks,
    "title": all_titles,
    "author": all_authors,
    "avg_rating": all_avg_ratings,
    "num_ratings": all_num_ratings,
    "year": all_years,
    "score": all_scores,
    "votes": all_votes,
    "book_url": all_book_urls
})

# Basic cleaning / types
df["avg_rating"] = pd.to_numeric(df["avg_rating"], errors="coerce")
df["num_ratings"] = pd.to_numeric(df["num_ratings"], errors="coerce", downcast="integer")
df["year"] = pd.to_numeric(df["year"], errors="coerce", downcast="integer")
df["score"] = pd.to_numeric(df["score"], errors="coerce", downcast="integer")
df["votes"] = pd.to_numeric(df["votes"], errors="coerce", downcast="integer")

# Preview
df.head(15)


Unnamed: 0,rank,title,author,avg_rating,num_ratings,year,score,votes,book_url
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,4.35,9744424,,4280600,43518,https://www.goodreads.com/book/show/2767052-th...
1,2,Pride and Prejudice,Jane Austen,4.29,4720297,,2944964,30181,https://www.goodreads.com/book/show/1885.Pride...
2,3,To Kill a Mockingbird,Harper Lee,4.26,6782110,,2588126,26428,https://www.goodreads.com/book/show/2657.To_Ki...
3,4,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.50,3743795,,2069966,21060,https://www.goodreads.com/book/show/58613451-h...
4,5,The Book Thief,Markus Zusak,4.39,2839327,,1957878,20110,https://www.goodreads.com/book/show/19063.The_...
...,...,...,...,...,...,...,...,...,...
10,11,The Picture of Dorian Gray,Oscar Wilde,4.13,1832030,,1356705,14147,https://www.goodreads.com/book/show/5297.The_P...
11,12,The Lightning Thief (Percy Jackson and the Oly...,Rick Riordan,4.31,3383752,,1245253,13004,https://www.goodreads.com/book/show/28187.The_...
12,13,Wuthering Heights,Emily Bront√´,3.90,1998233,,1241043,12942,https://www.goodreads.com/book/show/6185.Wuthe...
13,14,The Giving Tree,Shel Silverstein,4.38,1222830,,1237330,12805,https://www.goodreads.com/book/show/370493.The...


In [24]:

# Save to CSV
out_path = "..\data\goodreads_best_books_500.csv"
df.to_csv(out_path, index=False, encoding="utf-8")
print("Saved:", out_path, " ‚Äî rows:", len(df))


Saved: ..\data\goodreads_best_books_500.csv  ‚Äî rows: 500



## ‚úÖ Next steps
- Validate that you have ~500 rows (5 pages √ó ~100 books/page).
- You can add more pages if you need more books.
- Later, enrich with details from each **book page** (genres, description, etc.) if needed for the recommendation project.
- Combine with API-sourced books (another 500) to reach the project target.
