# Overview
Goal: What makes a bestseller? I often see the words #1 NYT bestseller on every book, but what does that really mean? I want to analyze the books in the 21st century that are most popular each year on prolific book rating site Goodreads.

# Introduction & Goal
Objective: Analyze NYT fiction bestsellers to understand factors contributing to their popularity, longevity, and reader engagement

Research Questions:
What common characteristics (genre, author, themes, Goodreads ratings) do bestselling novels share?
Is there a clear correlation between Goodreads metrics (ratings, reviews, tags) and books achieving #1 status?
Are there identifiable trends or shifts in preferences over recent decades?


In [None]:
!pip install -q pandas numpy matplotlib seaborn

!pip install -q kaggle

# Data Collection
Primary Sources:




## Kaggle Dataset (1931-2020)
Includes: Book Title, Author, Year, Weeks at #1, Category (Fiction/Non-fiction)

In [None]:
from google.colab import files
uploaded = files.upload()    # select your kaggle.json
print(uploaded.keys())       # shows exactly what file(s) arriv

Saving kaggle.json to kaggle.json
dict_keys(['kaggle.json'])


In [None]:
# 1. Make the hidden .kaggle folder
!mkdir -p ~/.kaggle

# 2. Move the uploaded kaggle.json into it
!mv kaggle.json ~/.kaggle/

# 3. Set strict permissions so only you can read it
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!pip install -q kaggle

In [None]:
!kaggle datasets list -s new-york-times-bestsellers

ref                                                       title                                                  size  lastUpdated                 downloadCount  voteCount  usabilityRating  
--------------------------------------------------------  -----------------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
sujaykapadnis/new-york-times-bestsellers                  New York Times BestSellers                           619998  2023-09-24 06:26:11.353000            342         12  0.9411765        
konradb/making-sense-with-sam-harris-transcripts          Making sense with Sam Harris: transcripts           3063947  2021-09-02 17:10:18.113000             48          7  0.625            
bryantreese/nyt-bestsellers-1931-2024-fictionnon-fiction  NYT Bestsellers 1931-2024 (Fiction/non-Fiction)     5604008  2024-11-26 09:02:56.037000             48          0  0.5294118        


In [None]:
!kaggle datasets download sujaykapadnis/new-york-times-bestsellers -p /content --unzip

Dataset URL: https://www.kaggle.com/datasets/sujaykapadnis/new-york-times-bestsellers
License(s): other


In [None]:
import pandas as pd

nyt_full = pd.read_csv('/content/nyt_full.tsv', sep='\t')
nyt_titles = pd.read_csv('/content/nyt_titles.tsv', sep='\t')

# Goodreads Scraper
book ID and title, book ID, book title, ISBN, ISBN13, year the book was first published, title, author, number of pages in the book, genres, top shelves, lists, total number of ratings, total number of reviews, average rating, rating distribution


## Creating the list of bestsellers

In [None]:
import re

# Keep only books that reached #1
best1 = (
    nyt_titles
    .loc[(nyt_titles['best_rank'] == 1), ['year','title']]
    .drop_duplicates()   # ensure each title/year appears only once
    .sort_values('year')
)

# Function to build "year.Cleaned_Title" slugs
def make_slug(row):
    title_clean = re.sub(r'[^\w\s]', '', row['title'])
    # collapse whitespace to single underscore
    title_clean = re.sub(r'\s+', '_', title_clean.strip())
    return f"{title_clean}"

# Apply to each row
slugs = best1.apply(make_slug, axis=1)

# Write to text file
with open('my_book_ids.txt', 'w') as f:
    for slug in slugs:
        f.write(slug + '\n')

print(f"Wrote {len(slugs)} unique slugs to my_book_ids.txt")

Wrote 967 unique slugs to my_book_ids.txt


## Finding the HTML links for all books

In [None]:
import requests
from bs4 import BeautifulSoup
import urllib.parse

HEADERS = {'User-Agent': 'Mozilla/5.0'}

def lookup_slug(title):
    """
    Search Goodreads for `title`, return the first slug (e.g. "1729347.Maid_In_Waiting")
    """
    q = urllib.parse.quote_plus(title)
    url = f"https://www.goodreads.com/search?q={q}"
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'lxml')

    # Find the first book link in the search results
    link = soup.select_one('a.bookTitle')
    if not link or not link.get('href'):
        return None

    # href looks like "/book/show/1729347.Maid_In_Waiting"
    slug = link['href'].split('/book/show/')[-1].split('?')[0]
    return slug  # e.g. "1729347.Maid_In_Waiting"

In [15]:
import concurrent.futures
import time
import urllib.parse
import requests
from bs4 import BeautifulSoup

HEADERS = {'User-Agent': 'Mozilla/5.0'}

def lookup_slug(title):
    q   = urllib.parse.quote_plus(title)
    url = f"https://www.goodreads.com/search?q={q}"
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'lxml')
    link = soup.select_one('a.bookTitle')
    if not link:
        return None
    slug = link['href'].split('/book/show/')[-1].split('?')[0]
    return slug

def parallel_lookup(titles, max_workers=20):
    slugs = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as exec:
        # schedule all lookups
        future_to_title = {exec.submit(lookup_slug, t): t for t in titles}
        # as each finishes, collect its result
        for future in concurrent.futures.as_completed(future_to_title):
            title = future_to_title[future]
            try:
                slug = future.result()
                if slug:
                    slugs.append(slug)
                    print(f"✅ {title} → {slug}")
                else:
                    print(f"⚠️  no slug for {title}")
            except Exception as e:
                print(f"❌ error for {title}: {e}")
            # tiny back‐off to avoid hammering
            time.sleep(0.1)
    return slugs

# Usage:
with open('my_book_ids.txt') as f:
    titles = [l.strip() for l in f if l.strip()]

# Run lookups in parallel (20 threads)
goodreads_slugs = parallel_lookup(titles, max_workers=20)

# Write them out
with open('goodreads_slugs.txt','w') as f:
    for slug in goodreads_slugs:
        f.write(slug + '\n')
print(f"Wrote {len(goodreads_slugs)} slugs.")

✅ MR_AND_MRS_PENNINGTON → 22847158-mr-and-mrs-pennington
✅ A_NEW_YORK_TEMPEST → 34836658-a-new-york-tempest
✅ ANN_VICKERS → 703003.Ann_Vickers
✅ THE_END_OF_DESIRE → 1705051.The_End_of_Desire
✅ MARYS_NECK → 13550632-mary-s-neck
✅ THE_SHELTERED_LIFE → 29527692-the-sheltered-life-of-betsy-parker
✅ LARK_ASCENDING → 59892263-lark-ascending
✅ DISTRICT_NURSE → 36676577-the-district-nurses-of-victory-walk
✅ BRIGHT_SKIN → 7199666-bright-eyes-brown-skin
✅ MAID_IN_WAITING → 1729347.Maid_In_Waiting
✅ FLOWERING_WILDERNESS → 1380929.Flowering_Wilderness
✅ SONS → 9520360-the-son-of-neptune
✅ A_MODERN_HERO → 931581.The_Boz
✅ MAGNOLIA_STREET → 706433.The_Ballroom_on_Magnolia_Street
✅ THE_STORE → 17660462-the-everything-store
✅ THE_GOOD_EARTH → 1078.The_Good_Earth
✅ THE_HARBOURMASTER → 208048435-the-harbourmaster
✅ FARAWAY → 17491.The_Enchanted_Wood
✅ INVITATION_TO_THE_WALTZ → 1477344.Invitation_to_the_Waltz
✅ THE_TEN_COMMANDMENTS → 271660.Talk_to_the_Snail
✅ THE_FOUNTAIN → 43220998-the-fountains-of-sil

## The Scraping Function

In [None]:
!pip install --quiet requests beautifulsoup4 lxml pandas

In [None]:
import argparse
import time
import requests
from bs4 import BeautifulSoup
import random

def fetch_html(slug, retries=3, backoff=1.0, timeout=10.0):
    url = f"https://www.goodreads.com/book/show/{slug}"
    headers = {'User-Agent': 'Mozilla/5.0'}
    for attempt in range(1, retries+1):
        try:
            resp = requests.get(url, headers=headers, timeout=timeout)
            resp.raise_for_status()
            return resp.text
        except requests.RequestException:
            if attempt < retries:
                time.sleep(backoff * attempt)
            else:
                raise

def get_title(soup):
    node = soup.select_one('h1[data-testid="bookTitle"]')
    return node.get_text(strip=True) if node else ''

def get_author(soup):
    node = soup.select_one('span[data-testid="name"]')
    return node.get_text(strip=True) if node else ''

def get_avg_rating(soup):
    node = soup.select_one('span[itemprop="ratingValue"]')
    return node.get_text(strip=True) if node else ''

def get_genres(soup):
    items = [g.get_text(strip=True) for g in soup.select('span.Button__labelItem')]
    return ';'.join(items)

def get_ratings_count(soup):
    node = soup.select_one('span[data-testid="ratingsCount"]')
    return node.get_text(strip=True) if node else ''

def get_reviews_count(soup):
    node = soup.select_one('span[data-testid="reviewsCount"]')
    return node.get_text(strip=True) if node else ''

def scrape_book(slug):
    html = fetch_html(slug)
    soup = BeautifulSoup(html, 'lxml')
    time.sleep(1 + random.random())

    return {
        'slug':           slug,
        'title':          get_title(soup),
        'author':         get_author(soup),
        'avg_rating':     get_avg_rating(soup),
        'genres':         get_genres(soup),
        'ratings_count':  get_ratings_count(soup),
        'reviews_count':  get_reviews_count(soup),
    }

Wrote 967 unique slugs to my_book_ids.txt


In [13]:
import requests
from bs4 import BeautifulSoup
import urllib.parse

HEADERS = {'User-Agent': 'Mozilla/5.0'}

def lookup_slug(title):
    """
    Search Goodreads for `title`, return the first slug (e.g. "1729347.Maid_In_Waiting")
    """
    q = urllib.parse.quote_plus(title)
    url = f"https://www.goodreads.com/search?q={q}"
    resp = requests.get(url, headers=HEADERS)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'lxml')

    # Find the first book link in the search results
    link = soup.select_one('a.bookTitle')
    if not link or not link.get('href'):
        return None

    # href looks like "/book/show/1729347.Maid_In_Waiting"
    slug = link['href'].split('/book/show/')[-1].split('?')[0]
    return slug  # e.g. "1729347.Maid_In_Waiting"

# Exploratory Data Analysis (EDA)
Describing the dataset:

### Temporal analysis:
* Number of unique books/authors per year
* Average weeks at #1 per year
* Longest consecutive weeks at #1



In [None]:
nyt_full['week'] = pd.to_datetime(nyt_full['week'])

# 1. Unique books/authors per year
books_per_year   = nyt_titles.groupby('year')['id'] \
                      .nunique().rename('unique_books')
authors_per_year = nyt_titles.groupby('year')['author'] \
                      .nunique().rename('unique_authors')

temporal_counts = pd.concat([books_per_year, authors_per_year], axis=1) \
                   .reset_index()
print("Unique books and authors per year:")
print(temporal_counts)

# 2. Weeks-at-#1 metrics (per book)
weeks_at_1 = (
    nyt_full[nyt_full['rank'] == 1]
    .groupby('title_id')
    .size()
    .rename('weeks_at_1')
)

# 3. Longest consecutive #1 streak (per book)
df_r1 = nyt_full[nyt_full['rank'] == 1].sort_values(['title_id','week'])

def max_run(week_series):
    diffs = week_series.diff().dt.days.fillna(7)
    runs  = (diffs != 7).cumsum()
    return week_series.groupby(runs).size().max()

streak1 = (
    df_r1
    .groupby('title_id')['week']           # explicitly select the week column
    .apply(max_run, include_groups=False)  # exclude grouping cols from apply
    .rename('max_consec_1')
)

# 4. Merge back into nyt_titles
df_metrics = nyt_titles.merge(weeks_at_1, left_on='id', right_index=True, how='left')
df_metrics = df_metrics.merge(streak1,    left_on='id', right_index=True, how='left')
df_metrics[['weeks_at_1','max_consec_1']] = df_metrics[['weeks_at_1','max_consec_1']].fillna(0)

# 5. Compute per-year summaries of those metrics
avg_weeks_per_year      = df_metrics.groupby('year')['weeks_at_1'] \
                              .mean().rename('avg_weeks_at_1')
longest_streak_per_year = df_metrics.groupby('year')['max_consec_1'] \
                              .max().rename('longest_streak_at_1')

weeks_summary = pd.concat([avg_weeks_per_year, longest_streak_per_year], axis=1) \
                  .reset_index()
print("\nAverage & longest consecutive weeks at #1 per year:")
print(weeks_summary)

Unique books and authors per year:
    year  unique_books  unique_authors
0   1931            13              13
1   1932            45              43
2   1933            40              38
3   1934            73              66
4   1935            66              63
..   ...           ...             ...
85  2016           182             155
86  2017           197             168
87  2018           174             151
88  2019           186             164
89  2020           164             148

[90 rows x 3 columns]

Average & longest consecutive weeks at #1 per year:
    year  avg_weeks_at_1  longest_streak_at_1
0   1931        0.846154                  6.0
1   1932        1.133333                  9.0
2   1933        1.400000                 28.0
3   1934        0.671233                 17.0
4   1935        0.772727                  7.0
..   ...             ...                  ...
85  2016        0.280220                  4.0
86  2017        0.263959                  5.0
87  201

### Author analysis:
* Frequency of recurring authors
* Authors dominating specific periods

In [None]:
author_freq_1 = (
    nyt_titles[nyt_titles['best_rank'] == 1]['author']
    .value_counts()
    .rename_axis('author')
    .reset_index(name='num_#1_titles')
)

print("Top 10 authors by number of #1 bestsellers:")
display(author_freq_1.head(10))

Top 10 authors by number of #1 bestsellers:


Unnamed: 0,author,num_#1_titles
0,Stephen King,40
1,Danielle Steel,35
2,John Grisham,34
3,Janet Evanovich,27
4,David Baldacci,21
5,Mary Higgins Clark,21
6,James Patterson,19
7,James Patterson and Maxine Paetro,17
8,Patricia Cornwell,16
9,Nora Roberts,15


Goodreads Scraper
book ID and title, book ID, book title, ISBN, ISBN13, year the book was first published, title, author, number of pages in the book, genres, top shelves, lists, total number of ratings, total number of reviews, average rating, rating distribution