## Step 1 ‚Äî Imports & Setup

In this step, we import our custom functions and configuration file to ensure the environment is ready.  
We will use the same utility functions (`fetch_html`, `parse_html`, `extract_books_from_soup`, etc.)  
from the `functions.py` file located in the `notebooks` folder.


In [None]:
# ============================================================
# Step 1 ‚Äî Imports & Setup
# ============================================================

import sys
from pathlib import Path
import importlib

# --- Add notebooks path and import functions ---
sys.path.append("notebooks")

import functions
from functions import load_config, ensure_directories

# --- Reload in case of changes ---
importlib.reload(functions)

# --- Load config from project root ---
config_path = Path("../config.yaml")
config = load_config(config_path)

# --- Ensure directories exist (defined in config.yaml) ---
ensure_directories(config["paths"])

print("‚úÖ All functions and configuration loaded successfully!")


## Step 2 ‚Äî Scrape Goodreads Pages 6‚Äì12

In this step, we will continue the web scraping process for the Goodreads list  
**‚ÄúBest Books Ever‚Äù**, focusing on pages **6 to 12** to complete a total of ~1000 books.  

We will use the same helper functions from `functions.py`:
- `fetch_html()` to retrieve page content  
- `parse_html()` to parse it with BeautifulSoup  
- `extract_books_from_soup()` to extract titles, authors, and metadata  

Each page will be scraped with a small random delay between requests (ethical scraping).  
The final result will be saved as `data/raw/goodreads_books_6_to_10.csv`.


In [None]:
# ============================================================
# Step 2 ‚Äî Scrape Goodreads Pages 6‚Äì12 (Extended Range)
# ============================================================

from functions import fetch_html, parse_html, extract_books_from_soup, save_dataset
from time import sleep
import random
import pandas as pd
from pathlib import Path

print("üîπ Starting scraping for pages 6‚Äì12...")

base_url = "https://www.goodreads.com/list/show/1.Best_Books_Ever?page="
all_books_part2 = []

# --- Loop through pages 6 to 12 ---
for page in range(6, 13):
    url = base_url + str(page)
    print(f"\nüåç Fetching page {page}: {url}")

    html = fetch_html(url)
    if html is None:
        print(f"‚ö†Ô∏è Skipping page {page} due to empty response.")
        continue

    soup = parse_html(html)
    books_page = extract_books_from_soup(soup)
    all_books_part2.append(books_page)

    # --- Ethical delay ---
    sleep(random.uniform(1, 2))

print(f"\n‚úÖ Scraping completed for {len(all_books_part2)} pages.")

# --- Combine all new DataFrames ---
df_part2 = pd.concat(all_books_part2, ignore_index=True)
print(f"‚úÖ Combined dataset shape (pages 6‚Äì12): {df_part2.shape}")

# --- Save dataset correctly in the root-level data/raw ---
raw_output_path = Path("..") / config["paths"]["data_raw"] / "goodreads_books_6_to_12.csv"
print(f"üìÅ Saving file to: {raw_output_path.resolve()}")

save_dataset(df_part2, raw_output_path)

# --- Preview ---
df_part2.head()



## Step 3 ‚Äî Combine New Books (Pages 6‚Äì12) with Existing Clean Dataset

In this step, we will combine the newly scraped books from pages **6‚Äì10**  
with the previously cleaned dataset (`books_clean.csv`) generated in Notebook 01.  

No raw CSV is saved ‚Äî we will only update the cleaned dataset structure,  
ensuring consistent columns (`title`, `author`, `rating`, `genre`, `price`, etc.)  
and removing any duplicates by title.


In [None]:
import pandas as pd
from pathlib import Path
from functions import save_dataset

# --- Load existing clean dataset ---
clean_path_prev = Path("..") / config["paths"]["data_clean"] / "books_clean.csv"
df_existing = pd.read_csv(clean_path_prev)
print(f"‚úÖ Loaded previous clean dataset: {df_existing.shape}")

# --- Define expected columns from the clean dataset ---
expected_cols = df_existing.columns.tolist()

# --- Create any missing columns in df_part2 ---
for col in expected_cols:
    if col not in df_part2.columns:
        df_part2[col] = pd.NA

# --- Ensure only expected columns are kept (ignore extras safely) ---
df_part2 = df_part2[[col for col in expected_cols if col in df_part2.columns]]

# --- Combine both datasets ---
df_combined = pd.concat(
    [df_existing, df_part2.dropna(how="all")],
    ignore_index=True
)

# --- Drop duplicates by title ---
df_combined = df_combined.drop_duplicates(subset=["title"]).reset_index(drop=True)

print(f"‚úÖ Combined dataset shape: {df_combined.shape}")
print(f"Unique authors: {df_combined['author'].nunique()}")

# ============================================================
# üßπ Handle Missing Values Before Saving
# ============================================================

# --- Fill missing genre and price before saving ---
df_combined["genre"] = df_combined["genre"].fillna("Unknown")

# --- Fill missing price with median (robust against outliers) ---
median_price = df_combined["price"].median()
df_combined["price"] = df_combined["price"].fillna(median_price)

print(f" Filled missing 'price' values with median: {median_price:.2f}")
print(f" Filled missing 'genre' with 'Unknown'")

# ============================================================
# üíæ Save the Updated Clean Dataset
# ============================================================

output_path = Path("..") / config["paths"]["data_clean"] / "books_clean_1000.csv"
print(f" Saving combined clean dataset to: {output_path.resolve()}")

save_dataset(df_combined, output_path)

# --- Quick preview ---
print("\nüìò Preview of combined dataset:")
display(df_combined.head(5))

# Step 4 ‚Äî Quick Data Verification

In this step, we will perform a quick verification of the combined dataset  
(`books_clean_1000.csv`) to ensure data integrity after merging the two sources.  

We will check:
- Dataset dimensions and column names  
- Missing values per column  
- Unique authors  
- Basic descriptive statistics (for numeric columns)


In [None]:
# ============================================================
# Step 4 ‚Äî Quick Data Verification
# ============================================================

import pandas as pd
from pathlib import Path

# --- Load combined dataset ---
combined_path = Path("..") / config["paths"]["data_clean"] / "books_clean_1000.csv"
df = pd.read_csv(combined_path)

print(f"‚úÖ Dataset loaded successfully: {combined_path}")
print(f"Shape: {df.shape}\n")

# --- Overview of columns ---
print("üìä Columns:")
print(df.columns.tolist())

# --- Quick info and missing values ---
print("\nüîç DataFrame Info:")
df.info()

print("\nüîç Missing values per column:")
print(df.isna().sum())

# --- Quick summary statistics (no datetime flag for older pandas) ---
print("\nüìà Descriptive Statistics:")
display(df.describe(include="all"))

# --- Check duplicates by title ---
dup_titles = df["title"].duplicated().sum()
print(f"\n‚ö†Ô∏è Duplicate titles found: {dup_titles}")

# --- Unique authors ---
unique_authors = df["author"].nunique()
print(f"üë©‚Äçüíª Unique authors: {unique_authors}")

# --- Example preview ---
print("\nüìò Preview of the combined dataset:")
display(df.head(10))


# Step 5 ‚Äî Enrich New Books (Pages 6‚Äì10) with Google Books API

In this step, we will enrich the newly added books (pages **6‚Äì10**)  
with metadata from the **Google Books API**, following the same method used  
in Notebook 01.  

We will retrieve:
- Published year  
- Genre / categories  
- Cover URL  
- Price and currency (when available)  

Only books missing `avg_rating`, `genre`, or `published_year` will be processed  
to avoid duplicate API requests.


In [None]:
# ============================================================
# Step 5 ‚Äî Reuse Google Books API functions
# ============================================================
import requests
from tqdm import tqdm
import time

def get_book_info_from_google(title, author):
    """Query Google Books API and return metadata for a given title + author."""
    query = f"intitle:{title}+inauthor:{author}"
    url = f"https://www.googleapis.com/books/v1/volumes?q={query}"

    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            data = response.json()
            if "items" in data and len(data["items"]) > 0:
                info = data["items"][0]["volumeInfo"]
                return {
                    "published_year": info.get("publishedDate", None),
                    "genre": ", ".join(info.get("categories", [])) if info.get("categories") else None,
                    "cover_url": info.get("imageLinks", {}).get("thumbnail", None)
                }
    except Exception as e:
        print(f"‚ö†Ô∏è Error fetching '{title}': {e}")
    
    return {"published_year": None, "genre": None, "cover_url": None}


def get_price_from_google(title, author):
    """Query Google Books API for price info (listPrice or retailPrice)."""
    query = f"intitle:{title}+inauthor:{author}"
    url = f"https://www.googleapis.com/books/v1/volumes?q={query}"

    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            data = response.json()
            if "items" in data:
                info = data["items"][0].get("saleInfo", {})
                price_info = info.get("listPrice", {}) or info.get("retailPrice", {})
                if price_info:
                    return price_info.get("amount"), price_info.get("currencyCode")
    except Exception as e:
        print(f"‚ö†Ô∏è Error fetching '{title}': {e}")
    return None, None


In [None]:
# ============================================================
# Step 5.1 ‚Äî Enrich only books missing metadata
# ============================================================

# --- Load combined dataset ---
from pathlib import Path
combined_path = Path("..") / config["paths"]["data_clean"] / "books_clean_1000.csv"
df = pd.read_csv(combined_path)

# --- Filter only the books that need enrichment ---
df_to_enrich = df[df["avg_rating"].isna()].copy()
print(f"üìö Books pending enrichment: {len(df_to_enrich)}")

# --- Apply Google Books API to retrieve metadata ---
results = []
for _, row in tqdm(df_to_enrich.iterrows(), total=len(df_to_enrich)):
    meta = get_book_info_from_google(row["title"], row["author"])
    results.append(meta)
    time.sleep(0.5)  # ethical delay

api_df = pd.DataFrame(results)
df_enriched = pd.concat([df_to_enrich.reset_index(drop=True), api_df], axis=1)

print(f"‚úÖ Metadata enrichment completed for {len(df_enriched)} books.")
df_enriched.head()


In [None]:
df_enriched.to_csv("../data/raw/temp_books_meta_backup.csv", index=False, encoding="utf-8-sig")
print("üíæ Backup saved successfully!")


In [None]:
df_enriched.isna().sum()


## Step 5.2‚Äî Price Enrichment (Optimized Version)

Retrieve book prices from Google Books API, using checkpointing to avoid
losing progress if interrupted. Merge results with previously enriched metadata
to build a fully enriched dataset.


In [None]:
# ============================================================
# Step 5.2 ‚Äî Price Enrichment (Optimized & Re-startable)
# ============================================================

import pandas as pd
import time
from tqdm import tqdm
from pathlib import Path

# --- Load the intermediate enriched file (from Step 5.2) ---
intermediate_path = Path("..") / config["paths"]["data_raw"] / "temp_books_meta_backup.csv"
df_enriched = pd.read_csv(intermediate_path)

print(f"‚úÖ Loaded enriched dataset for price retrieval: {df_enriched.shape[0]} books")

# --- Define checkpoint path ---
checkpoint_path = Path("..") / config["paths"]["data_raw"] / "temp_prices_checkpoint.csv"

# --- If checkpoint exists, resume from there ---
if checkpoint_path.exists():
    df_checkpoint = pd.read_csv(checkpoint_path)
    processed_titles = set(df_checkpoint["title"].unique())
    print(f"‚è© Resuming from checkpoint ({len(df_checkpoint)} books already processed).")
else:
    df_checkpoint = pd.DataFrame(columns=["title", "price", "currency"])
    processed_titles = set()
    print("üÜï Starting fresh price enrichment.")

# --- Filter only books not yet processed ---
df_to_process = df_enriched[~df_enriched["title"].isin(processed_titles)].copy()
print(f"üìö Remaining books to process: {len(df_to_process)}")

# --- Apply Google Books API to get prices ---
prices = []

for i, (_, row) in enumerate(tqdm(df_to_process.iterrows(), total=len(df_to_process))):
    price, currency = get_price_from_google(row["title"], row["author"])
    prices.append({"title": row["title"], "price": price, "currency": currency})
    
    # --- Save progress every 50 books ---
    if (i + 1) % 50 == 0 or (i + 1) == len(df_to_process):
        df_partial = pd.DataFrame(prices)
        df_checkpoint = pd.concat([df_checkpoint, df_partial], ignore_index=True)
        df_checkpoint.to_csv(checkpoint_path, index=False, encoding="utf-8-sig")
        print(f"üíæ Checkpoint saved ({len(df_checkpoint)} total so far)")
        prices = []  # reset buffer
    
    time.sleep(0.5)  # reduced ethical delay

print("\n‚úÖ Price enrichment completed!")

# --- Merge checkpoint results into main enriched dataset ---
df_prices_final = pd.read_csv(checkpoint_path)

# ‚úÖ Fix: ensure columns exist even if no new prices were processed
if "price" not in df_prices_final.columns:
    df_prices_final["price"] = None
if "currency" not in df_prices_final.columns:
    df_prices_final["currency"] = None

df_final = pd.merge(df_enriched, df_prices_final, on="title", how="left")

# --- Save final enriched dataset ---
output_path = Path("..") / config["paths"]["data_clean"] / "books_clean_enriched_1000.csv"
df_final.to_csv(output_path, index=False, encoding="utf-8-sig")

print(f"\nüíæ Final enriched dataset saved successfully ‚Üí {output_path.resolve()}")
print(f"Rows: {len(df_final)}, Columns: {len(df_final.columns)}")


# Step 6 ‚Äî Clean Duplicated Columns from Enriched Dataset

Before merging the old and new datasets, we will remove duplicate columns
with `_x` and `_y` suffixes. The `_y` columns contain the correct enriched data
(pulled from the Google Books API).


In [None]:
# ============================================================
# Step 6 ‚Äî Final Cleanup (Keep only API-enhanced columns)
# ============================================================

import pandas as pd
from pathlib import Path

# --- Load dataset (enriched) ---
clean_path = Path("..") / config["paths"]["data_clean"] / "books_clean_enriched_1000.csv"
df = pd.read_csv(clean_path)
print(f"‚úÖ Loaded dataset: {df.shape}")

# --- Columns to keep/replace manually based on enrichment ---
# Keep only the API-enriched versions of key columns
if "genre.1" in df.columns:
    df["genre"] = df["genre.1"]

if "published_year.1" in df.columns:
    df["published_year"] = df["published_year.1"]

if "price_y" in df.columns:
    df["price"] = df["price_y"]

if "currency_y" in df.columns:
    df["currency"] = df["currency_y"]

if "cover_url.1" in df.columns:
    df["cover_url"] = df["cover_url.1"]

# --- Drop unwanted duplicates ---
cols_to_drop = [
    "genre.1", "published_year.1",
    "price_x", "price_y",
    "currency_x", "currency_y",
    "cover_url", "cover_url.1"
]
df = df.drop(columns=[c for c in cols_to_drop if c in df.columns], errors="ignore")

# --- Reorder columns (keep avg_rating) ---
cols_final = [
    "title", "author", "avg_rating", "genre",
    "published_year", "price", "currency", "cover_url", "link"
]
df = df[[c for c in cols_final if c in df.columns]]

# --- Save safely under new name ---
final_path = Path("..") / config["paths"]["data_clean"] / "books_clean_enriched_final.csv"
df.to_csv(final_path, index=False, encoding="utf-8-sig")

print(f"üíæ Cleaned dataset saved safely ‚Üí {final_path.resolve()}")
print(f"‚úÖ Final shape: {df.shape}")
display(df.head(10))


## Step 6.1 ‚Äî Recalculate and Populate `avg_rating`

In this step, we ensure that the `avg_rating` column in the newly scraped dataset  
(`books_clean_enriched_1000.csv`) is complete and consistent with the first dataset  
(`books_clean.csv`).

Since both datasets represent books from the same Goodreads list,  
we use the ratings from the first dataset as a reference.  
If a book title exists in both datasets, we copy its `avg_rating`.  
If it doesn‚Äôt exist, we assign the global average rating (‚âà 4.1).

This guarantees that all books have a valid numerical rating value  
before merging both datasets in the next step.



In [None]:
# ============================================================
# Step 6.1 ‚Äî Recalculate and Populate avg_rating
# ============================================================

import pandas as pd
from pathlib import Path

# --- Paths ---
base_path = Path("..") / "data" / "clean" / "books_clean.csv"
enriched_path = Path("..") / "data" / "clean" / "books_clean_enriched_1000.csv"

# --- Load datasets ---
df_base = pd.read_csv(base_path)
df_enriched = pd.read_csv(enriched_path)

print(f"üìò Base dataset: {df_base.shape}")
print(f"üíé Enriched dataset: {df_enriched.shape}")

# --- Compute global average rating for fallback ---
global_avg = df_base["avg_rating"].mean()
print(f"üåç Global average rating: {global_avg:.2f}")

# --- Merge ratings by title ---
df_enriched = df_enriched.merge(
    df_base[["title", "avg_rating"]],
    on="title",
    how="left",
    suffixes=("", "_base")
)

# --- Fill missing ratings ---
df_enriched["avg_rating"] = df_enriched["avg_rating"].fillna(df_enriched["avg_rating_base"])
df_enriched["avg_rating"] = df_enriched["avg_rating"].fillna(global_avg)

# --- Drop helper column ---
df_enriched = df_enriched.drop(columns=["avg_rating_base"], errors="ignore")

# --- Save intermediate result ---
df_enriched.to_csv(enriched_path, index=False, encoding="utf-8-sig")

print(f"‚úÖ Ratings populated and saved ‚Üí {enriched_path.resolve()}")


In [None]:
# ============================================================
# Step 6.2 ‚Äî Final Cleanup (Preserve all URLs)
# ============================================================

import pandas as pd
from pathlib import Path

# --- Load dataset (after ratings filled) ---
clean_path = Path("..") / "data" / "clean" / "books_clean_enriched_1000.csv"
df = pd.read_csv(clean_path)
print(f"‚úÖ Loaded dataset: {df.shape}")

# --- Replace columns with API-enriched versions ---
if "genre.1" in df.columns:
    df["genre"] = df["genre.1"]

if "published_year.1" in df.columns:
    df["published_year"] = df["published_year.1"]

if "price_y" in df.columns:
    df["price"] = df["price_y"]

if "currency_y" in df.columns:
    df["currency"] = df["currency_y"]

if "cover_url.1" in df.columns:
    df["cover_url"] = df["cover_url.1"]

# --- Drop only redundant duplicates ---
cols_to_drop = [
    "genre.1", "published_year.1",
    "price_x", "price_y",
    "currency_x", "currency_y",
    "cover_url.1"
]
df = df.drop(columns=[c for c in cols_to_drop if c in df.columns], errors="ignore")

# --- Keep both 'url' and 'link' if exist ---
url_cols = [c for c in ["url", "link"] if c in df.columns]

# --- Reorder columns (preserving both URLs) ---
cols_final = [
    "title", "author", "avg_rating", "genre",
    "published_year", "price", "currency",
    "cover_url", *url_cols
]
df = df[[c for c in cols_final if c in df.columns]]

# --- Save final clean dataset ---
final_path = Path("..") / "data" / "clean" / "books_clean_enriched_1000.csv"
df.to_csv(final_path, index=False, encoding="utf-8-sig")

print(f"üíæ Cleaned dataset saved successfully ‚Üí {final_path.resolve()}")
print(f"‚úÖ Final shape: {df.shape}")
display(df.head(10))


# Step 7 ‚Äî Merge Final Dataset (1000 Books)

Now that both datasets are fully cleaned and enriched,  
we combine them into a single master dataset containing around 1000 books.

This step:
- Loads the first dataset (`books_clean.csv`) ‚Äî pages 1‚Äì5  
- Loads the new dataset (`books_clean_enriched_1000.csv`) ‚Äî pages 6‚Äì10  
- Merges both, removes duplicates by title, and ensures column consistency  
- Saves the final version as `books_final_1000.csv` in the `/data/clean` folder

The resulting dataset will be the input for the **Exploratory Data Analysis** phase in the next notebook.


In [None]:
# ============================================================
# Step 7 ‚Äî Merge Final Dataset (1000 Books)
# ============================================================

from pathlib import Path

# --- Paths ---
base_path = Path("..") / config["paths"]["data_clean"]
path_part1 = base_path / "books_clean.csv"                 # Dataset from pages 1‚Äì5
path_part2 = base_path / "books_clean_enriched_1000.csv"   # Dataset from pages 6‚Äì10
path_final = base_path / "books_final_1000.csv"            # Output file

# --- Load datasets ---
df_part1 = pd.read_csv(path_part1)
df_part2 = pd.read_csv(path_part2)

print(f"üìò First dataset: {df_part1.shape}")
print(f"üìó Second dataset: {df_part2.shape}")

# --- Standardize columns ---
common_cols = [c for c in df_part1.columns if c in df_part2.columns]
df_part1 = df_part1[common_cols]
df_part2 = df_part2[common_cols]

# --- Combine and clean ---
df_final = pd.concat([df_part1, df_part2], ignore_index=True)
df_final.drop_duplicates(subset=["title"], inplace=True)

print(f"‚úÖ Combined dataset shape: {df_final.shape}")
print(f"Unique authors: {df_final['author'].nunique()}")

# --- Quick sanity check ---
missing = df_final.isna().sum()
print("\nüîç Missing values summary:")
print(missing[missing > 0])

# --- Save final dataset ---
df_final.to_csv(path_final, index=False, encoding="utf-8-sig")

print(f"\nüíæ Final dataset saved successfully ‚Üí {path_final.resolve()}")
print(df_final.head(10))


## Step 7.1 ‚Äî Standardize and Format Final Dataset

Before moving on to the modeling phase, we will perform a **final cleaning and standardization** step to ensure consistency across all fields.

This includes:
- Rounding all numerical values (e.g. `avg_rating`, `price`) to **two decimals**
- Stripping extra spaces from text columns
- Standardizing capitalization for `genre` and `currency`
- Sorting data alphabetically by `title`
- Improving **visual alignment** (text to the left, numbers centered) for readability in Jupyter

The cleaned dataset will overwrite the existing `books_final_1000.csv` in `data/clean/`.

In [None]:
# ============================================================
# Step 7.1 ‚Äî Standardize and Format Final Dataset
# ============================================================

import pandas as pd
from pathlib import Path

# --- Load final dataset ---
final_path = Path("..") / config["paths"]["data_clean"] / "books_final_1000.csv"
df = pd.read_csv(final_path)
print(f"‚úÖ Loaded dataset: {df.shape}")

# --- Round numeric columns to 2 decimals ---
for col in ["avg_rating", "price"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce").round(2)

# --- Enforce numeric display format (2 decimals in Jupyter) ---
pd.options.display.float_format = "{:.2f}".format

# --- Clean text columns ---
for col in ["title", "author", "genre", "currency"]:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip()

# --- Standardize capitalization ---
if "genre" in df.columns:
    df["genre"] = df["genre"].str.title()
if "currency" in df.columns:
    df["currency"] = df["currency"].str.upper()

# --- Sort alphabetically by title ---
df = df.sort_values("title").reset_index(drop=True)

# --- Save standardized dataset ---
df.to_csv(final_path, index=False, encoding="utf-8-sig")
print(f"üíæ Final standardized dataset saved ‚Üí {final_path.resolve()}")

# --- Load the final standardized dataset ---
final_path = Path("..") / config["paths"]["data_clean"] / "books_final_1000.csv"
df = pd.read_csv(final_path)
print(f"‚úÖ Loaded dataset for display: {df.shape}")

# --- Jupyter display settings ---
pd.set_option("display.max_colwidth", 120)
pd.set_option("display.float_format", "{:.2f}".format)

# --- Define column alignment ---
left_cols = ["title", "author", "genre", "cover_url", "link"]
center_cols = [c for c in df.columns if c not in left_cols]

# --- Apply notebook-only style ---
styled_df = (
    df.head(20)
    .style
    .set_properties(subset=left_cols, **{"text-align": "left"})
    .set_properties(subset=center_cols, **{"text-align": "center"})
    .set_table_styles(
        [{"selector": "th", "props": [("text-align", "center")]}]  # center headers
    )
)

print("\nü™Ñ Preview ‚Äî Text aligned left, numbers centered:")
display(styled_df)

# Step 8 ‚Äî Final Summary and Transition

Before moving on to feature extraction and clustering, let's summarize the full pipeline executed in this notebook.

We have combined both enriched datasets (Part 1 and Part 2), standardized them, and validated the integrity of the final dataset.

---

### üìä Summary of the Data Merge

**First dataset:** (493, 9)  
**Second dataset:** (697, 8)  
‚úÖ **Combined dataset shape:** (1190, 8)  
üë©‚Äçüíª **Unique authors:** 714  

**Missing values summary:**

| Column | Missing Values |
|:--|--:|
| genre | 149 |
| price | 636 |
| currency | 636 |
| cover_url | 118 |

---

üíæ **Final dataset saved successfully ‚Üí**

data/clean/books_final_1000.csv

In [None]:
print(f"Total unique books: {df_final['title'].nunique()}")
