
# üìö Goodreads ‚Äî *Best Books Ever* (Scraping)
**Goal:** scrape the list **Best Books Ever** from Goodreads and build a CSV with the first **600** books (about 6 pages).  

> Target page: https://www.goodreads.com/list/show/1.Best_Books_Ever

**Fields we aim to collect per book (if available on the page):**
- Rank (position on the list)
- Title
- Author
- Average rating
- Number of ratings
- Year (if shown)
- Score / Votes
- Book URL


In [1]:

# üì¶ Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time
import random

# üîß Display and warning settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)

import warnings
warnings.filterwarnings('ignore')

# üåê Headers (polite: set a user-agent)
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0 Safari/537.36"
    )
}

BASE_URL = "https://www.goodreads.com"
LIST_URL = "https://www.goodreads.com/list/show/1.Best_Books_Ever"



## 1) Fetch page 1 and inspect
We start with page 1 to confirm the structure, then we will generalize to multiple pages.


In [2]:

# Request page 1
page = 1
url = LIST_URL + f"?page={page}"
print("Requesting:", url)
response = requests.get(url, headers=headers)
print("Status code:", response.status_code)

# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")

# Try to locate the main table that contains the list
table = soup.find("table", {"class": "tableList"})
type(table), (str(table)[:300] + "...") if table else "table not found"


Requesting: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1
Status code: 200


(bs4.element.Tag,
 '<table class="tableList js-dataTooltip">\n<!-- Add query string params -->\n<tr itemscope="" itemtype="http://schema.org/Book">\n<td class="number" valign="top">1</td>\n<td valign="top" width="5%">\n<div class="u-anchorTarget" id="2767052"></div>\n<div class="js-tooltipTrigger tooltipTrigger" data-resourc...')


## 2) Parse rows on a single page
Each book is typically a `tr` row inside the table with class `tableList`.
We'll collect columns with very **defensive parsing** (lots of `if` checks) so it doesn't break if some field is missing.


In [3]:
# Temporary lists (single page demo first)
ranks = []
titles = []
authors = []
author_url = [] 
avg_ratings = []
num_ratings = []
scores = []
votes = []
book_urls = []
book_genres = []
book_published_year = []

rows = table.select("tr") if table else []
print("Rows found on page", page, ":", len(rows))

# We'll use a running counter for rank within the page (the site usually shows it, but we'll keep a simple counter)
local_rank = 0

for row in rows:
    local_rank += 1
    
    # Title & URL
    a_title = row.select_one("a.bookTitle")  # sometimes text is inside a span
    title_text = ""
    rel_url = ""
    if a_title:
        # Goodreads wraps the visible text in a nested <span>
        # we fall back to the anchor text if no span
        span_title = a_title.select_one("span")
        if span_title and span_title.get_text(strip=True):
            title_text = span_title.get_text(strip=True)
        else:
            title_text = a_title.get_text(strip=True)
        rel_url = a_title.get("href", "")

    # Author
    a_author = row.select_one("a.authorName")
    author_text = ""
    author_href = ""
    if a_author:
        span_author = a_author.select_one("span")
        if span_author and span_author.get_text(strip=True):
            author_text = span_author.get_text(strip=True)
        else:
            author_text = a_author.get_text(strip=True)
            
        author_href = a_author.get("href", "")
        if author_href:
            # If is relative path we will covert to absolute
            if not author_href.startswith("http"):
                author_href = BASE_URL + author_href

    # Ratings (e.g., "4.28 avg rating ‚Äî 7,534,822 ratings")
    mini = row.select_one("span.minirating")
    avg_val = None
    num_val = None
    if mini:
        t = mini.get_text(" ", strip=True)
        # average rating
        m1 = re.search(r"([0-9]\.[0-9]+)\s*avg rating", t)
        if m1:
            try:
                avg_val = float(m1.group(1))
            except:
                avg_val = None
        # number of ratings
        m2 = re.search(r"([0-9][0-9,]*)\s*ratings", t)
        if m2:
            try:
                num_val = int(m2.group(1).replace(",", ""))
            except:
                num_val = None

    # Score / Votes (sometimes appears in the same small text block with "score:" / "voters")
    score_val = None
    votes_val = None
    smalls = row.select("span.smallText.uitext")  # ‚Üê clases con punto
    for s in smalls:
        st = s.get_text(" ", strip=True).lower()

        # score: 276,401
        m4 = re.search(r"score:\s*([\d,]+)", st)
        if m4 and score_val is None:
            try:
                score_val = int(m4.group(1).replace(",", ""))
            except:
                score_val = None

        # 2,965 people voted
        m5 = re.search(r"([\d,]+)\s*people voted", st)
        if m5 and votes_val is None:
            try:
                votes_val = int(m5.group(1).replace(",", ""))
            except:
                votes_val = None

    # Absolute URL
    abs_url = ""
    if rel_url:
        if rel_url.startswith("http"):
            abs_url = rel_url
        else:
            abs_url = BASE_URL + rel_url
    
    genres = []
    published_year = None
    if abs_url:
        try:
            # Request the book detail page
            detail_res = requests.get(abs_url, headers=headers)
            detail_soup = BeautifulSoup(detail_res.text, "html.parser")

            # 1) GENRES (new Goodreads layout)
            # Look for genre buttons, e.g.
            # <span class="BookPageMetadataSection__genreButton">
            #   <a ...><span class="Button__labelItem">Young Adult</span></a>
            # </span>
            genre_spans = detail_soup.select("span.BookPageMetadataSection__genreButton span.Button__labelItem")

            for g in genre_spans:
                g_text = g.get_text(strip=True)
                if g_text and g_text not in genres:
                    genres.append(g_text)

            # Fallback (older layout or if nothing found): links to /genres/...
            if not genres:
                fallback_genres = detail_soup.select("a[href*='/genres/']")
                for g in fallback_genres:
                    g_text = g.get_text(strip=True)
                    if g_text and g_text not in genres:
                        genres.append(g_text)
                        
            # 2) PUBLISHED YEAR
            # Target:
            # <p data-testid="publicationInfo">First published September 14, 2008</p>
            pub_info = detail_soup.find("p", {"data-testid": "publicationInfo"})
            year_text = ""

            if pub_info:
                year_text = pub_info.get_text(" ", strip=True)
            else:
                # Fallback: search any text containing "First published"
                possible = detail_soup.find(string=lambda t: t and "First published" in t)
                if possible:
                    year_text = possible.strip()

            if year_text:
                # Find all 4-digit years in the text (1900-2099)
                years_found = re.findall(r"(18[0-9]{2}|19[0-9]{2}|20[0-9]{2})", year_text)

                if years_found:
                    try:
                        # If multiple years appear, take the last one (usually the relevant one)
                        published_year = int(years_found[-1])
                    except:
                        published_year = None

            # Polite pause to not overload server
            time.sleep(0.3)

        except:
            # In case of any error, keep defaults
            genres = []
            published_year = None
            
    # Append
    ranks.append(local_rank)
    titles.append(title_text)
    authors.append(author_text)
    author_url.append(author_href)
    avg_ratings.append(avg_val)
    num_ratings.append(num_val)
    book_genres.append(genres)
    book_published_year.append(published_year)
    scores.append(score_val)
    votes.append(votes_val)
    book_urls.append(abs_url)

# Build DataFrame for the single page
df_page1 = pd.DataFrame({
    "rank_in_page": ranks,
    "title": titles,
    "author": authors,
    "author_url": author_url,
    "avg_rating": avg_ratings,
    "num_ratings": num_ratings,
    "book_genres": book_genres,
    "book_published_year": book_published_year,
    "score": scores,
    "votes": votes,
    "book_url": book_urls
})

df_page1.head(10)

Rows found on page 1 : 100


Unnamed: 0,rank_in_page,title,author,author_url,avg_rating,num_ratings,book_genres,book_published_year,score,votes,book_url
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,https://www.goodreads.com/author/show/153394.S...,4.35,9748182,"[Young Adult, Dystopia, Fiction, Fantasy, Scie...",2008.0,4283248,43546,https://www.goodreads.com/book/show/2767052-th...
1,2,Pride and Prejudice,Jane Austen,https://www.goodreads.com/author/show/1265.Jan...,4.29,4722539,"[Classics, Romance, Fiction, Historical Fictio...",1813.0,2945744,30189,https://www.goodreads.com/book/show/1885.Pride...
2,3,To Kill a Mockingbird,Harper Lee,https://www.goodreads.com/author/show/1825.Har...,4.26,6784603,"[Classics, Fiction, Historical Fiction, School...",1960.0,2589212,26439,https://www.goodreads.com/book/show/2657.To_Ki...
3,4,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,https://www.goodreads.com/author/show/1077326....,4.5,3745274,"[Fantasy, Young Adult, Fiction, Harry Potter, ...",2003.0,2070361,21065,https://www.goodreads.com/book/show/58613451-h...
4,5,The Book Thief,Markus Zusak,https://www.goodreads.com/author/show/11466.Ma...,4.39,2840351,"[Historical Fiction, Fiction, Young Adult, Cla...",2005.0,1958034,20112,https://www.goodreads.com/book/show/19063.The_...
5,6,"Twilight (The Twilight Saga, #1)",Stephenie Meyer,https://www.goodreads.com/author/show/941441.S...,3.67,7218768,"[Fantasy, Young Adult, Romance, Fiction, Vampi...",2005.0,1753717,17875,https://www.goodreads.com/book/show/41865.Twil...
6,7,Animal Farm,George Orwell,https://www.goodreads.com/author/show/3706.Geo...,4.01,4454594,"[Classics, Fiction, Dystopia, Fantasy, School,...",1945.0,1700431,17597,https://www.goodreads.com/book/show/170448.Ani...
7,8,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien,https://www.goodreads.com/author/show/656983.J...,4.61,142923,"[Fantasy, Fiction, Classics, Adventure, Scienc...",1954.0,1642146,17013,https://www.goodreads.com/book/show/30.J_R_R_T...
8,9,The Chronicles of Narnia (The Chronicles of Na...,C.S. Lewis,https://www.goodreads.com/author/show/1069006....,4.28,701557,"[Fantasy, Classics, Fiction, Young Adult, Chil...",1956.0,1526303,15897,https://www.goodreads.com/book/show/11127.The_...
9,10,The Fault in Our Stars,John Green,https://www.goodreads.com/author/show/1406384....,4.12,5649979,"[Young Adult, Romance, Fiction, Contemporary, ...",2012.0,1410832,14600,https://www.goodreads.com/book/show/11870085-t...



## 3) Loop over the first 5 pages (500 books)
Now we repeat the same logic for pages 1 to 5 and combine everything into a single DataFrame.
We add a short random `sleep` between requests to be polite.


In [None]:
# Global containers (all pages)
all_ranks = []
all_titles = []
all_authors = []
all_authors_url = []
all_avg_ratings = []
all_num_ratings = []
all_scores = []
all_votes = []
all_book_urls = []
all_book_genres = []
all_book_published_year = []

# We'll compute a global rank across pages as we go
global_rank = 0

for page in range(1, 7):  # pages 1..6  (100 books per page => 600 books)
    url = LIST_URL + f"?page={page}"
    print(f"\nFetching page {page}: {url}")
    response = requests.get(url, headers=headers)
    print("Status:", response.status_code)
    soup = BeautifulSoup(response.text, "html.parser")
    table = soup.find("table", {"class": "tableList"})
    rows = table.select("tr") if table else []
    print("Rows:", len(rows))

    for row in rows:
        global_rank += 1
    
        # Title & URL
        a_title = row.select_one("a.bookTitle")
        title_text = ""
        rel_url = ""
        if a_title:
            span_title = a_title.select_one("span")
            if span_title and span_title.get_text(strip=True):
                title_text = span_title.get_text(strip=True)
            else:
                title_text = a_title.get_text(strip=True)
            rel_url = a_title.get("href", "")
    
        # Author
        a_author = row.select_one("a.authorName")
        author_text = ""
        author_href = ""
        if a_author:
            span_author = a_author.select_one("span")
            if span_author and span_author.get_text(strip=True):
                author_text = span_author.get_text(strip=True)
            else:
                author_text = a_author.get_text(strip=True)
                
            author_href = a_author.get("href", "")
            if author_href:
                # If is relative path we will covert to absolute
                if not author_href.startswith("http"):
                    author_href = BASE_URL + author_href
    
        # Ratings
        mini = row.select_one("span.minirating")
        avg_val = None
        num_val = None
        if mini:
            t = mini.get_text(" ", strip=True)
            m1 = re.search(r"([0-9]\.[0-9]+)\s*avg rating", t)
            if m1:
                try:
                    avg_val = float(m1.group(1))
                except:
                    avg_val = None
            m2 = re.search(r"([0-9][0-9,]*)\s*ratings", t)
            if m2:
                try:
                    num_val = int(m2.group(1).replace(",", ""))
                except:
                    num_val = None
    
        # Year
        year_val = None
    
        # Score / Votes
        score_val = None
        votes_val = None
        smalls = row.select("span.smallText.uitext")  # ‚Üê clases con punto
        for s in smalls:
            st = s.get_text(" ", strip=True).lower()
            m4 = re.search(r"score:\s*([\d,]+)", st)
            if m4 and score_val is None:
                try:
                    score_val = int(m4.group(1).replace(",", ""))
                except:
                    score_val = None
            m5 = re.search(r"([\d,]+)\s*people voted", st)
            if m5 and votes_val is None:
                try:
                    votes_val = int(m5.group(1).replace(",", ""))
                except:
                    votes_val = None
            
        # Absolute URL
        abs_url = ""
        if rel_url:
            if rel_url.startswith("http"):
                abs_url = rel_url
            else:
                abs_url = BASE_URL + rel_url
    
        book_genres = []
        book_published_year = None
        if abs_url:
            try:
                # Request the book detail page
                detail_res = requests.get(abs_url, headers=headers)
                detail_soup = BeautifulSoup(detail_res.text, "html.parser")

                # 1) GENRES (new Goodreads layout)
                # Look for genre buttons, e.g.
                # <span class="BookPageMetadataSection__genreButton">
                #   <a ...><span class="Button__labelItem">Young Adult</span></a>
                # </span>
                genre_spans = detail_soup.select(
                    "span.BookPageMetadataSection__genreButton span.Button__labelItem"
                )

                for g in genre_spans:
                    g_text = g.get_text(strip=True)
                    if g_text and g_text not in book_genres:
                        book_genres.append(g_text)

                # Fallback (older layout or if nothing found): links to /genres/...
                if not book_genres:
                    fallback_genres = detail_soup.select("a[href*='/genres/']")
                    for g in fallback_genres:
                        g_text = g.get_text(strip=True)
                        if g_text and g_text not in book_genres:
                            book_genres.append(g_text)
                            
                # 2) PUBLISHED YEAR
                # Target:
                # <p data-testid="publicationInfo">First published September 14, 2008</p>
                pub_info = detail_soup.find("p", {"data-testid": "publicationInfo"})
                year_text = ""

                if pub_info:
                    year_text = pub_info.get_text(" ", strip=True)
                else:
                    # Fallback: search any text containing "First published"
                    possible = detail_soup.find(string=lambda t: t and "First published" in t)
                    if possible:
                        year_text = possible.strip()

                if year_text:
                    # Find all 4-digit years in the text (1900-2099)
                    years_found = re.findall(r"(18[0-9]{2}|19[0-9]{2}|20[0-9]{2})", year_text)

                    if years_found:
                        try:
                            # If multiple years appear, take the last one (usually the relevant one)
                            book_published_year = int(years_found[-1])
                        except:
                            book_published_year = None

                # Polite pause to not overload server
                time.sleep(0.3)

            except:
                # In case of any error, keep defaults
                book_genres = []
                book_published_year = None
    
        # Append
        all_ranks.append(global_rank)
        all_titles.append(title_text)
        all_authors.append(author_text)
        all_authors_url.append(author_href)
        all_avg_ratings.append(avg_val)
        all_num_ratings.append(num_val)
        all_book_genres.append(book_genres)
        all_book_published_year.append(book_published_year)
        all_scores.append(score_val)
        all_votes.append(votes_val)
        all_book_urls.append(abs_url)
        
    # Polite sleep between pages
    time.sleep(random.uniform(1.0, 2.5))

len(all_ranks), len(all_titles)


Fetching page 1: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1
Status: 200
Rows: 100

Fetching page 2: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=2
Status: 200
Rows: 100

Fetching page 3: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=3
Status: 200
Rows: 100

Fetching page 4: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=4
Status: 200
Rows: 100

Fetching page 5: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=5
Status: 200
Rows: 100

Fetching page 6: https://www.goodreads.com/list/show/1.Best_Books_Ever?page=6
Status: 200
Rows: 100


(600, 600)


## 4) Build the final DataFrame and save to CSV
We keep numeric columns as proper numbers and export everything.


In [5]:
df = pd.DataFrame({
    "rank": all_ranks,
    "title": all_titles,
    "author": all_authors,
    "author_url": all_authors_url, 
    "avg_rating": all_avg_ratings,
    "num_ratings": all_num_ratings,
    "genres": all_book_genres,
    "year": all_book_published_year,
    "score": all_scores,
    "votes": all_votes,
    "book_url": all_book_urls
})

# Basic cleaning / types
df["avg_rating"] = pd.to_numeric(df["avg_rating"], errors="coerce")
df["num_ratings"] = pd.to_numeric(df["num_ratings"], errors="coerce", downcast="integer")
df["year"] = pd.to_numeric(df["year"], errors="coerce", downcast="integer")
df["score"] = pd.to_numeric(df["score"], errors="coerce", downcast="integer")
df["votes"] = pd.to_numeric(df["votes"], errors="coerce", downcast="integer")

# Preview
df.head(15)

Unnamed: 0,rank,title,author,author_url,avg_rating,num_ratings,genres,year,score,votes,book_url
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,https://www.goodreads.com/author/show/153394.S...,4.35,9748182,"[Young Adult, Dystopia, Fiction, Fantasy, Scie...",2008.0,4283248,43546,https://www.goodreads.com/book/show/2767052-th...
1,2,Pride and Prejudice,Jane Austen,https://www.goodreads.com/author/show/1265.Jan...,4.29,4722539,"[Classics, Romance, Fiction, Historical Fictio...",1813.0,2945744,30189,https://www.goodreads.com/book/show/1885.Pride...
2,3,To Kill a Mockingbird,Harper Lee,https://www.goodreads.com/author/show/1825.Har...,4.26,6784603,"[Classics, Fiction, Historical Fiction, School...",1960.0,2589212,26439,https://www.goodreads.com/book/show/2657.To_Ki...
3,4,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,https://www.goodreads.com/author/show/1077326....,4.50,3745274,"[Fantasy, Young Adult, Fiction, Harry Potter, ...",2003.0,2070361,21065,https://www.goodreads.com/book/show/58613451-h...
4,5,The Book Thief,Markus Zusak,https://www.goodreads.com/author/show/11466.Ma...,4.39,2840351,"[Historical Fiction, Fiction, Young Adult, Cla...",2005.0,1958034,20112,https://www.goodreads.com/book/show/19063.The_...
...,...,...,...,...,...,...,...,...,...,...,...
10,11,The Picture of Dorian Gray,Oscar Wilde,https://www.goodreads.com/author/show/3565.Osc...,4.13,1833393,"[Classics, Fiction, Horror, Gothic, Fantasy, L...",1890.0,1357402,14154,https://www.goodreads.com/book/show/5297.The_P...
11,12,The Lightning Thief (Percy Jackson and the Oly...,Rick Riordan,https://www.goodreads.com/author/show/15872.Ri...,4.31,3384984,"[Fantasy, Young Adult, Mythology, Fiction, Mid...",2005.0,1245842,13010,https://www.goodreads.com/book/show/28187.The_...
12,13,Wuthering Heights,Emily Bront√´,https://www.goodreads.com/author/show/4191.Emi...,3.90,1999389,"[Classics, Fiction, Romance, Gothic, Historica...",1847.0,1241242,12944,https://www.goodreads.com/book/show/6185.Wuthe...
13,14,The Giving Tree,Shel Silverstein,https://www.goodreads.com/author/show/435477.S...,4.38,1223189,"[Childrens, Classics, Fiction, Picture Books, ...",1964.0,1237518,12807,https://www.goodreads.com/book/show/370493.The...


In [6]:
# Save to CSV
out_path = "../data/goodreads_best_books_600.csv"
df.to_csv(out_path, index=False, encoding="utf-8")
print("Saved:", out_path, " ‚Äî rows:", len(df))

Saved: ../data/goodreads_best_books_600.csv  ‚Äî rows: 600



## ‚úÖ Next steps
- Validate that we have 500 rows (5 pages √ó 100 books/page).
- Add more pages if we need more books.
- Enrich with details from each **book page** (genres, description, etc.) if needed for the recommendation project.
- Combine with API-sourced books (another 500) to reach the project target.
