# Data Enrichment & Dataset Integration

## Objectives

The purpose of this notebook is to **enrich, align, and integrate the cleaned datasets** to create a unified analytical foundation for modelling book satisfaction and evaluating catalogue diversity.

This notebook expands upon prior cleaning work by **adding missing metadata, linking overlapping records across datasets, filtering the dataset to English-language titles, and preparing a model-ready dataset** that combines catalog-level information (BBE) with user-behavioral data (Goodbooks).

Ultimately, this notebook enables insights that neither dataset could provide independently, most critically, **genre diversity analysis**, **language-based consistency**, **metadata-enhanced prediction modeling**.

---

## Inputs

| Dataset                             | Source                     | Description                                                                                         | Format |
| ----------------------------------- | -------------------------- | --------------------------------------------------------------------------------------------------- | ------ |
| `bbe_clean_v13.csv`                  | Output from Notebook 02    | Cleaned *Best Books Ever* metadata including title, authors, genres, rating, description, and more. | CSV    |
| `books_clean_v7.csv`      | Output from Notebook 02    | Cleaned Goodbooks-10k metadata lacking genre data but containing structural identifiers.            | CSV    |
| `ratings_clean_v1.csv`    | Output from Notebook 02    | User–book interaction and aggregated rating data for behavioral modeling.                           | CSV    |
| *(Optional)* External API responses | OpenLibrary / Google Books | Supplemental metadata (genres, languages, subjects) for non-overlapping titles.                     | JSON   |

---

## Tasks in This Notebook

This notebook will execute the following enrichment and integration steps:

1. **Standardize linking identifiers**
   Normalize `isbn_clean`, `goodreads_id`, `title_clean`, and `author_clean` across datasets to ensure reliable cross-dataset merging.

2. **Identify overlap between BBE and Goodbooks**
   Detect books present in both datasets using multi-key matching and evaluate match quality.

3. **Enrich Goodbooks metadata with missing genres**

   * Use BBE genre fields for overlapping titles.
   * Query external APIs for non-overlapping titles.
   * Normalize all genre outputs into a unified taxonomy.

4. **Complete and standardize language metadata**
   Fill missing values using BBE, APIs, or text-based heuristics, then harmonize language labels and codes.

5. **Filter the enriched datasets to English-language books**
   Restrict the unified dataset to titles identified as **English-language**, ensuring consistency for:

   * genre diversity comparisons
   * ratings behavior
   * regression modeling

   *(Non-English titles will be kept only in the enriched BBE/Goodbooks outputs, but excluded from the model dataset.)*

6. **Integrate datasets into a model-ready schemas**
   Combine BBE metadata with Goodbooks behavioral features for all overlapping **English-language** books.

7. **Validate enrichment and filtering results**

   * Assess metadata fields fill rates
   * Review API match and success metrics
   * Log all imputation and filtering decisions for reproducibility

8. **Export enriched and unified datasets**
   Produce final English-filtered datasets ready for modeling and analysis.

---

## Outputs

* **en_supply_catalog.csv** — enriched metadata for all BBE books, representing the supply catalog
* **en_internal_catalog.csv** — enriched metadata for all Goodbooks books, representing the internal catalog
* **model_dataset_warm_start.csv** — unified metadata + behavioral dataset filtered to English-language books. Includes external BBE signals for cross platform validation.
* **model_dataset_cold_start.csv** — unified metadata + behavioral dataset filtered to English-language books. Excludes external BBE signals for pure internal modeling.
* **Enrichment and filtering logs** — documenting imputation sources, API usage, and filtering decisions

> **Note:** This notebook focuses on **metadata enrichment, English-language filtering, and dataset integration**. Model development and feature engineering will be performed in later notebooks.

# Set up

## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [1]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

Current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics\notebooks


To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [2]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

Changed directory to parent.
New current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics


In [3]:
import numpy as np
import pandas as pd

from src.cleaning.utils.categories import (
    map_subjects_to_genres
)
from src.cleaning.utils.pipeline import apply_cleaners_selectively

## Load and Inspect Datasets

In this step, we load the previously cleaned datasets: **Goodbooks-10k** (books, ratings) and **Best Books Ever**. 

In [4]:
import pandas as pd 

# load datasets
books_clean = pd.read_csv(
    'data/interim/goodbooks/books_clean_v7.csv',
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
    )
ratings_clean = pd.read_csv('data/interim/goodbooks/ratings_clean_v0.csv')
bbe_clean = pd.read_csv(
    "data/interim/bbe/bbe_clean_v13.csv",
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
)

# create copies for imputation
books_impute = books_clean.copy()
bbe_impute = bbe_clean.copy()

# log samples
print("BBE dataset columns:")
print(bbe_impute.columns.tolist())
print("BBE dataset info:")
display(bbe_impute.info())
print("BBE dataset sample:")
display(bbe_impute.head(3))

print("Books dataset columns:")
print(books_impute.columns.tolist())
print("Books dataset info:")
display(books_impute.info())
print("Books dataset sample:")
display(books_impute.head(3))

BBE dataset columns:
['goodreads_id_clean', 'authors_list', 'author_clean', 'title_clean', 'isbn_clean', 'language_clean', 'publication_date_clean', 'publisher_clean', 'is_major_publisher', 'bookFormat_clean', 'rating_clean', 'numRatings_clean', 'numRatings_log', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'ratings_1_share', 'ratings_2_share', 'ratings_3_share', 'ratings_4_share', 'ratings_5_share', 'has_award', 'genres_clean', 'genres_simplified', 'description_clean', 'description_nlp', 'series_clean', 'pages_clean', 'bbeVotes_clean', 'bbeScore_clean', 'likedPercent_clean', 'has_likedPercent', 'price_clean', 'price_flag']
BBE dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52424 entries, 0 to 52423
Data columns (total 36 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   goodreads_id_clean      52424 non-null  string 
 1   authors_list            52424 non-null  object 
 2   auth

None

BBE dataset sample:


Unnamed: 0,goodreads_id_clean,authors_list,author_clean,title_clean,isbn_clean,language_clean,publication_date_clean,publisher_clean,is_major_publisher,bookFormat_clean,...,description_clean,description_nlp,series_clean,pages_clean,bbeVotes_clean,bbeScore_clean,likedPercent_clean,has_likedPercent,price_clean,price_flag
0,2767052,['suzanne collins'],suzanne collins,the hunger games,9780439023481.0,en,2008-09-14,scholastic,True,hardcover,...,winning means fame and fortunelosing means cer...,winning means fame and fortunelosing means cer...,the hunger games,374.0,30516,2993816,96.0,1,5.09,False
1,2,"['jk rowling', 'mary grandpre']","jk rowling, mary grandpre",harry potter and the order of the phoenix,9780439358071.0,en,2003-06-21,scholastic,True,paperback,...,there is a door at the end of a silent corrido...,there is a door at the end of a silent corrido...,harry potter,870.0,26923,2632233,98.0,1,7.38,False
2,2657,['harper lee'],harper lee,to kill a mockingbird,,en,2007-07-11,harpercollins,True,paperback,...,the unforgettable novel of a childhood in a sl...,the unforgettable novel of a childhood in a sl...,to kill a mockingbird,324.0,23328,2269402,95.0,1,,True


Books dataset columns:
['book_id', 'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'goodreads_id_clean', 'best_book_id_clean', 'work_id_clean', 'authors_list', 'author_clean', 'language_clean', 'publication_date_clean', 'isbn_clean', 'isbn13_clean', 'isbn_standard', 'rating_clean', 'numRatings_clean', 'numRatings_log', 'ratings_1_share', 'ratings_2_share', 'ratings_3_share', 'ratings_4_share', 'ratings_5_share', 'work_text_reviews_log', 'series_clean', 'title_clean']
Books dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   book_id                  10000 non-null  int64  
 1   work_text_reviews_count  10000 non-null  int64  
 2   ratings_1                10000 non-null  int64  
 3   ratings_2                10000 non-null  int64  
 4   ratings_3                10

None

Books dataset sample:


Unnamed: 0,book_id,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,goodreads_id_clean,best_book_id_clean,work_id_clean,...,numRatings_clean,numRatings_log,ratings_1_share,ratings_2_share,ratings_3_share,ratings_4_share,ratings_5_share,work_text_reviews_log,series_clean,title_clean
0,1,155254,66715,127936,560092,1481305,2706317,2767052,2767052,2792775,...,4942365,15.413355,0.013499,0.025886,0.113325,0.299716,0.547575,11.952824,the hunger games,the hunger games
1,2,75867,75504,101676,455024,1156318,3011543,3,3,4640799,...,4800065,15.38414,0.01573,0.021182,0.094795,0.240896,0.627396,11.23675,harry potter,harry potter and the sorcerer's stone
2,3,95009,456191,436802,793319,875073,1355439,41865,41865,3212258,...,3916824,15.180792,0.11647,0.111519,0.202541,0.223414,0.346056,11.461737,twilight,twilight


# Data Enrichment

## Enriching Goodbooks

### From BBE overlap

To improve the completeness and quality of the Goodbooks-10k dataset, we selectively merge in metadata from the Best Books Ever (BBE) dataset using the shared `goodreads_id_clean` key. Goodbooks is kept as the primary source, while BBE is used to supply additional metadata fields, such as genres and page counts, as well as to fill in missing values for shared attributes like ISBN, publication date, and series.

This approach ensures we enhance Goodbooks only where necessary: adding new information where it is absent and completing incomplete entries without overwriting existing data. The resulting `gb_enriched` dataset combines both sources into a more reliable and feature-rich foundation for downstream analytics and modeling.


In [5]:
# ---------------------------------------------
# ENRICH GOODBOOKS (books_impute) WITH BBE DATA
# ---------------------------------------------

import pandas as pd

# columns to enrich ONLY when GB has NaN
columns_to_enrich = [
    "publication_date_clean",
    "series_clean",
    "isbn_clean",
    "language_clean"
    ]

# columns existent only in BBE
bbe_only_columns = [
    "pages_clean",
    "genres_clean",
    "genres_simplified",
    "publisher_clean",
    "is_major_publisher",
    "has_award",
    "description_clean",
    "description_nlp"
]

# merge Goodbooks with the needed BBE columns
merge_cols = ["goodreads_id_clean"] + columns_to_enrich + bbe_only_columns

gb_enriched = books_impute.merge(
    bbe_impute[merge_cols].add_suffix("_bbe"),
    left_on="goodreads_id_clean",
    right_on="goodreads_id_clean_bbe",
    how="left"
)

# ---------------------------------------------
# ENRICH GENRE COLUMNS
# ---------------------------------------------
print("\n--- ENRICHING METADATA ---")
for col in bbe_only_columns:
    gb_enriched[col] = gb_enriched[col + "_bbe"]
    filled = gb_enriched[col].notna().sum()
    print(f"{col}: filled {filled} rows from BBE")

# ---------------------------------------------
# ENRICH SHARED COLUMNS ONLY WHERE GB IS NaN
# ---------------------------------------------
print("\n--- ENRICHING SHARED COLUMNS (GB NaN -> fill from BBE) ---")
for col in columns_to_enrich:
    before = gb_enriched[col].isna().sum()
    gb_enriched[col] = gb_enriched[col].fillna(gb_enriched[col + "_bbe"])
    after = gb_enriched[col].isna().sum()
    print(f"{col}: filled {before - after} missing values")

# ---------------------------------------------
# CLEANUP
# ---------------------------------------------
gb_enriched = gb_enriched.drop(columns=[c for c in gb_enriched.columns if c.endswith("_bbe")])

print("\nEnrichment complete!")
print("Final shape:", gb_enriched.shape)
gb_enriched[['isbn_clean','title_clean', 'series_clean', 'genres_clean', 'genres_simplified', 'pages_clean', 'publication_date_clean']].head()


--- ENRICHING METADATA ---
pages_clean: filled 8053 rows from BBE
genres_clean: filled 8082 rows from BBE
genres_simplified: filled 8082 rows from BBE
publisher_clean: filled 7954 rows from BBE
is_major_publisher: filled 8082 rows from BBE
has_award: filled 8082 rows from BBE
description_clean: filled 8009 rows from BBE
description_nlp: filled 8009 rows from BBE

--- ENRICHING SHARED COLUMNS (GB NaN -> fill from BBE) ---
publication_date_clean: filled 102 missing values
series_clean: filled 1133 missing values
isbn_clean: filled 984 missing values
language_clean: filled 684 missing values

Enrichment complete!
Final shape: (10000, 36)


Unnamed: 0,isbn_clean,title_clean,series_clean,genres_clean,genres_simplified,pages_clean,publication_date_clean
0,439023483.0,the hunger games,the hunger games,"['young adult', 'fiction', 'dystopia', 'fantas...","['young adult', 'fiction', 'dystopia', 'fantas...",374.0,2008-01-01
1,439554934.0,harry potter and the sorcerer's stone,harry potter,"['fantasy', 'fiction', 'young adult', 'magic',...","['fantasy', 'fiction', 'young adult', 'magic',...",309.0,1997-01-01
2,316015849.0,twilight,twilight,"['young adult', 'fantasy', 'romance', 'vampire...","['young adult', 'fantasy', 'romance', 'vampire...",501.0,2005-01-01
3,,to kill a mockingbird,to kill a mockingbird,"['classics', 'fiction', 'historical fiction', ...","['classics', 'fiction', 'historical fiction', ...",324.0,1960-01-01
4,743273567.0,the great gatsby,,"['classics', 'fiction', 'school', 'literature'...","['classics', 'fiction', 'school', 'literature'...",200.0,1925-01-01


In [6]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/enriched/")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 1

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v1 saved successfully in data/interim/merge directory.


### From external APIs

To further enrich the Goodbooks-10k dataset, we leverage external APIs such as OpenLibrary and Google Books to fill in missing metadata for titles not covered by the BBE overlap. This process involves querying these APIs using available identifiers (like ISBN or title/author combinations) to retrieve additional information such as genres, page counts, and publication details.

In [7]:
import re

def clean_isbn(isbn):
    if not isinstance(isbn, str):
        return None
    isbn = re.sub(r'[^0-9Xx]', '', isbn)
    if len(isbn) in [10, 13]:
        return isbn
    return None

gb_enriched['isbn_query'] = gb_enriched['isbn_clean'].apply(clean_isbn)

In [8]:
missing_mask = (
    gb_enriched['language_clean'].isna() |
    gb_enriched['language_clean'].isin(['unknown', '', 'None']) |
    gb_enriched['pages_clean'].isna() |
    gb_enriched['publication_date_clean'].isna()  |
    gb_enriched['publisher_clean'].isna() |
    gb_enriched['description_clean'].isna()
)

to_impute = gb_enriched[missing_mask].copy()
print("Books needing external enrichment:", len(to_impute))

Books needing external enrichment: 2219


#### Querying OpenLibrary API

After enriching Goodbooks with BBE overlap data, we identify **2,219** books still missing critical metadata (ISBN, language, pages, publication date, publisher). We query **OpenLibrary first** because it has no rate limits or API key requirements, making it ideal for bulk enrichment. We create a boolean mask to identify books needing enrichment, then query OpenLibrary's ISBN endpoint for each book, collecting results in a structured format.

The results are merged back into `gb_enriched` and saved as **version 2**. This incremental saving strategy ensures we don't lose progress if subsequent API calls fail or exceed quotas.

In [9]:
import json
from pathlib import Path

# cache path for OpenLibrary in data/raw
OL_CACHE_PATH = Path("data/raw/openlibrary_api_cache.json")

# create directory if it doesn't exist
OL_CACHE_PATH.parent.mkdir(parents=True, exist_ok=True)

# load existing cache if it exists
if OL_CACHE_PATH.exists():
    with open(OL_CACHE_PATH, "r") as f:
        ol_cache = json.load(f)
    print(f"Loaded {len(ol_cache)} cached OpenLibrary entries")
else:
    ol_cache = {}
    print("No existing cache found, starting fresh")

Loaded 1720 cached OpenLibrary entries


You can skip the 2 cells below if you have already run the OpenLibrary queries and saved the results. Go to : _Run cached OpenLibrary queries to avoid re-querying the API, if you have previously saved the results._

In [None]:
import requests
import time

def query_openlibrary(isbn):
    """Return OL metadata in a consistent dict format."""

    isbn_str = str(isbn)
    
    if isbn_str in ol_cache:
        return ol_cache[isbn_str]
    
    # Default structure to guarantee stable DataFrame columns
    result = {
        "pages_openlib": None,
        "publication_date_openlib": None,
        "language_openlib": None,
        "subjects_openlib": None,
        "publisher_openlib": None, 
        "description_openlib": None, 
    }

    if isbn is None or pd.isna(isbn) or isbn == "":
        return result
    
    url = f"https://openlibrary.org/isbn/{isbn}.json"

    try:
        r = requests.get(url, timeout=10)
        time.sleep(0.2)

        if r.status_code != 200:
            return result

        data = r.json()

        # Pages
        result["pages_openlib"] = data.get("number_of_pages")

        # Publication date
        result["publication_date_openlib"] = data.get("publish_date")

        # Language
        if "languages" in data and isinstance(data["languages"], list):
            key = data["languages"][0].get("key", "").split("/")[-1]
            result["language_openlib"] = key

        # Subjects
        if "subjects" in data:
            result["subjects_openlib"] = [s.lower() for s in data["subjects"]]
        
        # Publisher
        if "publishers" in data and isinstance(data["publishers"], list):
            result["publisher_openlib"] = data["publishers"][0]
        
        # Description
        desc = data.get("description")
        if isinstance(desc, dict):
            result["description_openlib"] = desc.get("value")
        elif isinstance(desc, str):
            result["description_openlib"] = desc


    except Exception as e:
        pass  # keep the default result structure

    # Save to cache
    ol_cache[isbn_str] = result
    return result


In [None]:
import time
import json
from tqdm import tqdm
from pathlib import Path

results = []
for isbn in tqdm(to_impute['isbn_query'], desc="Querying OpenLibrary"):
    results.append(query_openlibrary(isbn))
    time.sleep(0.2)   # safe rate limit
    
# Save OpenLibrary cache after queries
with open(OL_CACHE_PATH, "w") as f:
    json.dump(ol_cache, f, indent=2)
print(f"OpenLibrary cache saved with {len(ol_cache)} entries")

**Run cached OpenLibrary queries to avoid re-querying the API, if you have previously saved the results.** Skip if you just ran the API queries above in the same session. 

In [10]:
# run this cell only if you want to use the cached results without querying again

results = []
for isbn in to_impute['isbn_query']:
    isbn_str = str(isbn) if pd.notna(isbn) else ""
    if isbn_str in ol_cache:
        results.append(ol_cache[isbn_str])
    else:
        # If not in cache, return default structure
        results.append({
            "pages_openlib": None,
            "publication_date_openlib": None,
            "language_openlib": None,
            "subjects_openlib": None,
            "publisher_openlib": None,
            "description_openlib": None,
        })

Continued workflow:

In [11]:
# convert results to dataframe
ol_df = pd.DataFrame(results, index=to_impute.index)
print("API results summary:")
print(ol_df.notna().sum())

# merge back into gb_enriched
for col in ol_df.columns:
    if col not in gb_enriched.columns:
        gb_enriched[col] = None
    gb_enriched.loc[ol_df.index, col] = ol_df[col]

# verify the merge
print("\nAfter merge:")
print(gb_enriched[ol_df.columns].notna().sum())

API results summary:
pages_openlib               1370
publication_date_openlib    1708
language_openlib            1445
subjects_openlib            1027
publisher_openlib           1660
description_openlib          523
dtype: int64

After merge:
pages_openlib               1370
publication_date_openlib    1708
language_openlib            1445
subjects_openlib            1027
publisher_openlib           1660
description_openlib          523
dtype: int64


In [12]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/enriched")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 2

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v2 saved successfully in data/interim/merge directory.


#### Cleaning and Processing OpenLibrary Data

We apply the same cleaning steps used in Notebook 02, compiled into a pipeline, to standardize OpenLibrary API responses. The `apply_cleaners_selectively()` function ensures consistent data types, formats, and validation across all metadata fields. After cleaning, we fill missing values in `gb_enriched` using the cleaned OpenLibrary data.

For genre enrichment, we map OpenLibrary subjects to our standardized genre taxonomy using `map_subjects_to_genres()`. This populates `genres_simplified` for books that had subjects but no genre data, significantly improving genre coverage. The enriched dataset is saved as **version 3**.

In [13]:
# clean OpenLibrary API data
gb_enriched = apply_cleaners_selectively(
    gb_enriched,
    fields_to_clean=[
        'pages',
        'publication_date',
        'language',
        'subjects',
        'publisher',
        'description'
        ],
    source_suffix='_openlib',
    target_suffix='_openlib_clean',
    inplace=False
)

# verify cleaning
print("\nSample of cleaned OpenLibrary data:")
gb_enriched[[
    'title_clean',
    'pages_clean',
    'pages_openlib',
    'pages_openlib_clean',
    'publication_date_clean',
    'publication_date_openlib',
    'publication_date_openlib_clean',
    'language_clean',
    'language_openlib',
    'language_openlib_clean',
    'genres_clean',
    'genres_simplified',
    'subjects_openlib',
    'subjects_openlib_clean',
    'publisher_clean',
    'description_openlib',
    'description_clean',
    'description_openlib',
    'description_openlib_clean'
    ]].sample(15, random_state=42)


Sample of cleaned OpenLibrary data:


Unnamed: 0,title_clean,pages_clean,pages_openlib,pages_openlib_clean,publication_date_clean,publication_date_openlib,publication_date_openlib_clean,language_clean,language_openlib,language_openlib_clean,genres_clean,genres_simplified,subjects_openlib,subjects_openlib_clean,publisher_clean,description_openlib,description_clean,description_openlib.1,description_openlib_clean
6252,scion of ikshvaku,354.0,,,2015-01-01,,,en,,,"['mythology', 'fiction', 'fantasy', 'indian li...","['mythology', 'fiction', 'fantasy', 'other', '...",,,westland publication,,ram rajya the perfect land but perfection has ...,,
4684,canada,420.0,,,2012-01-01,,,en,,,"['fiction', 'canada', 'literary fiction', 'con...","['fiction', 'other', 'literary fiction', 'cont...",,,harpercollins,,first i'll tell about the robbery our parents ...,,
1731,the man in the brown suit,381.0,,,1924-01-01,,,en,,,"['mystery', 'fiction', 'crime', 'classics', 'm...","['mystery', 'fiction', 'crime', 'classics', 'm...",,,harpercollins,,newly-orphaned anne beddingfeld is a nice engl...,,
4742,twilight and philosophy vampires vegetarians a...,259.0,,,2009-01-01,,,en,,,"['philosophy', 'nonfiction', 'vampires', 'essa...","['philosophy', 'nonfiction', 'vampires', 'essa...",,,wiley,,the first look at the philosophy behind stephe...,,
4521,saga vol 5,,152.0,152.0,2015-01-01,"September 15, 2015",2015-09-15,en,eng,en,,,"[military deserters, parents of exceptional ch...","[military deserters, parents of exceptional ch...",,,,,
6340,asterix the gaul,48.0,,,1960-01-01,,,en,,,"['comics', 'graphic novels', 'bande dessine', ...","['comics', 'graphic novels', 'other', 'fiction...",,,"orion books ltd, london",,the year is 50 bc and all gaul is occupied onl...,,
576,tuck everlasting,148.0,,,1975-01-01,,,en,,,"['fantasy', 'young adult', 'classics', 'fictio...","['fantasy', 'young adult', 'classics', 'fictio...",,,macmillan,,doomed to - or blessed with - eternal life aft...,,
5202,domes of fire,,480.0,480.0,1992-01-01,"May 29, 1993",1993-05-29,,eng,en,,,"[fiction - fantasy, fiction, fantasy, fantasy ...","[fiction - fantasy, fiction, fantasy, fantasy ...",,,,,
6363,when we were orphans,,320.0,320.0,2000-01-01,"March 3, 2005",2005-03-03,en,,,,,"[modern fiction, fiction]","[modern fiction, fiction]",,,,,
439,fall of giants,985.0,,,2010-01-01,,,en,,,"['historical fiction', 'fiction', 'historical'...","['historical fiction', 'fiction', 'historical'...",,,penguin random house,,this is an epic of love hatred war and revolut...,,


In [14]:
# fill missing values with cleaned OpenLibrary data
print("\n--- Filling missing values with cleaned OpenLibrary data ---")

# fill pages_clean
before_pages = gb_enriched['pages_clean'].isna().sum()
gb_enriched['pages_clean'] = gb_enriched['pages_clean'].fillna(gb_enriched['pages_openlib_clean'])
after_pages = gb_enriched['pages_clean'].isna().sum()
print(f"pages_clean: filled {before_pages - after_pages} values")

# fill publication_date_clean
before_date = gb_enriched['publication_date_clean'].isna().sum()
gb_enriched['publication_date_clean'] = gb_enriched['publication_date_clean'].fillna(gb_enriched['publication_date_openlib_clean'])
after_date = gb_enriched['publication_date_clean'].isna().sum()
print(f"publication_date_clean: filled {before_date - after_date} values")

# fill language_clean
# Create mask that catches both NaN and invalid string values
before_lang = (gb_enriched['language_clean'].isna() | 
               gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()

mask = (gb_enriched['language_clean'].isna() | 
        gb_enriched['language_clean'].isin(['unknown', '', 'None']))

gb_enriched.loc[mask, 'language_clean'] = gb_enriched.loc[mask, 'language_openlib_clean']

after_lang = (gb_enriched['language_clean'].isna() | 
              gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()

print(f"language_clean: filled {before_lang - after_lang} values")

# fill publisher_clean
print("\n--- Filling missing publisher_clean using OpenLibrary data ---")
before_publisher = gb_enriched['publisher_clean'].isna().sum()
gb_enriched['publisher_clean'] = gb_enriched['publisher_clean'].fillna(
    gb_enriched['publisher_openlib_clean']
)
after_publisher = gb_enriched['publisher_clean'].isna().sum()
print(f"publisher_clean: filled {before_publisher - after_publisher} values")

# fill description_clean
print("\n--- Filling missing description_clean using OpenLibrary data ---")

before_desc = gb_enriched['description_clean'].isna().sum()
gb_enriched['description_clean'] = gb_enriched['description_clean'].fillna(
    gb_enriched['description_openlib_clean']
)
after_desc = gb_enriched['description_clean'].isna().sum()
print(f"description_clean: filled {before_desc - after_desc} values")

# generate genres_simplified from subjects_openlib_clean
print("\n--- Generating genres_simplified from OpenLibrary subjects ---")

# Import genre mapping utilities

# Fill genres_simplified for books that have subjects but no genres
books_needing_genre_mapping = (
    (gb_enriched['genres_simplified'].isna()) & 
    (gb_enriched['subjects_openlib_clean'].notna())
)

print(f"Books with subjects but no genres_simplified: {books_needing_genre_mapping.sum()}")

if books_needing_genre_mapping.sum() > 0:
    # Apply genre mapping to subjects
    gb_enriched.loc[books_needing_genre_mapping, 'genres_simplified'] = (
        gb_enriched.loc[books_needing_genre_mapping, 'subjects_openlib_clean']
        .apply(lambda x: map_subjects_to_genres(x) if isinstance(x, list) else None)
    )
    
    filled_genres = (
        gb_enriched.loc[books_needing_genre_mapping, 'genres_simplified'].notna().sum()
    )
    print(f"genres_simplified: mapped {filled_genres} values from OpenLibrary subjects")
    
    # Show sample of newly mapped genres
    newly_mapped = gb_enriched[books_needing_genre_mapping & gb_enriched['genres_simplified'].notna()]
    if len(newly_mapped) > 0:
        print("\nSample of newly mapped genres:")
        print(newly_mapped[['title_clean', 'subjects_openlib_clean', 'genres_simplified']].head(5))



--- Filling missing values with cleaned OpenLibrary data ---
pages_clean: filled 1257 values
publication_date_clean: filled 9 values
language_clean: filled 305 values

--- Filling missing publisher_clean using OpenLibrary data ---
publisher_clean: filled 1515 values

--- Filling missing description_clean using OpenLibrary data ---
description_clean: filled 509 values

--- Generating genres_simplified from OpenLibrary subjects ---
Books with subjects but no genres_simplified: 951
genres_simplified: mapped 796 values from OpenLibrary subjects

Sample of newly mapped genres:
            title_clean                             subjects_openlib_clean  \
29            gone girl  [fiction suspense, fiction mystery detective g...   
32  memoirs of a geisha  [geishas -- fiction, women -- japan -- fiction...   
43         the notebook                          [modern fiction, fiction]   
47       fahrenheit 451  [bradbury ray - prose criticism, spanishcontem...   
70         frankenstein  [fran

In [15]:
print("\n--- ENRICHMENT SUMMARY ---")

# Show books that received OpenLibrary data
books_with_ol_data = gb_enriched[
    gb_enriched['subjects_openlib_clean'].notna()
].copy()

print(f"\nTotal books enriched with OpenLibrary data: {len(books_with_ol_data)}")

if len(books_with_ol_data) > 0:
    print("\nSample of books enriched from OpenLibrary:")
    display(books_with_ol_data[[
        'title_clean',
        'author_clean',
        'pages_openlib_clean',
        'publication_date_openlib_clean',
        'language_openlib_clean',
        'subjects_openlib_clean',
        'genres_simplified',
        'publisher_clean',
        'description_clean'
    ]].head(10))

# Show overall genre coverage
print(f"\n--- GENRE COVERAGE AFTER ENRICHMENT ---")
print(f"Books with genres_clean: {gb_enriched['genres_clean'].notna().sum()}")
print(f"Books without genres_clean: {gb_enriched['genres_clean'].isna().sum()}")
print(f"Genre coverage: {gb_enriched['genres_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%\n")
print(f"Books with genres_simplified: {gb_enriched['genres_simplified'].notna().sum()}")
print(f"Books without genres_simplified: {gb_enriched['genres_simplified'].isna().sum()}")
print(f"Genre simplified coverage: {gb_enriched['genres_simplified'].notna().sum() / len(gb_enriched) * 100:.1f}%")
# Show overall description and publisher coverage
print(f"Books with description_clean: {gb_enriched['description_clean'].notna().sum()}")
print(f"Books without description_clean: {gb_enriched['description_clean'].isna().sum()}")
print(f"Books description coverage: {gb_enriched['description_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%")
print(f"Books with publisher_clean: {gb_enriched['publisher_clean'].notna().sum()}")
print(f"Books without publisher_clean: {gb_enriched['publisher_clean'].isna().sum()}")
print(f"Books publisher coverage: {gb_enriched['publisher_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%")


--- ENRICHMENT SUMMARY ---

Total books enriched with OpenLibrary data: 1027

Sample of books enriched from OpenLibrary:


Unnamed: 0,title_clean,author_clean,pages_openlib_clean,publication_date_openlib_clean,language_openlib_clean,subjects_openlib_clean,genres_simplified,publisher_clean,description_clean
29,gone girl,gillian flynn,399.0,2012-01-01,en,"[fiction suspense, fiction mystery detective g...","[fiction, mystery]",weidenfeld nicolson,just how well can you ever know the person you...
32,memoirs of a geisha,arthur golden,758.0,2005-01-01,en,"[geishas -- fiction, women -- japan -- fiction...","[fiction, historical fiction]",penguin random house,
43,the notebook,nicholas sparks,272.0,2004-07-05,,"[modern fiction, fiction]",[fiction],penguin random house,
47,fahrenheit 451,ray bradbury,176.0,2006-01-03,es,"[bradbury ray - prose criticism, spanishcontem...","[fiction, science fiction, non-fiction]",plaza y janes,
70,frankenstein,"mary wollstonecraft shelley, percy bysshe shel...",273.0,2003-01-01,en,"[frankenstein-- fiction, scientists -- fiction...",[fiction],penguin random house,presents the story of dr frankenstein and his ...
83,jurassic park,michael crichton,467.0,2006-01-01,es,"[suspense, fiction, fiction - general, spanish...",[fiction],debolsillo,
146,thirteen reasons why,jay asher,288.0,2008-01-01,en,"[suicide -- fiction, high schools -- fiction, ...",[fiction],razorbill,when high school student clay jenkins receives...
166,american gods,neil gaiman,672.0,2002-03-04,en,[science fiction],"[fiction, science fiction]",headline book publishing,
173,the shack,william paul young,252.0,2007-01-01,en,"[life change events -- fiction, missing childr...","[fiction, children]",windblown media,mackenzie allen phillips' youngest daughter mi...
194,the guernsey literary and potato peel pie society,"mary ann shaffer, annie barrows",288.0,2008-01-01,en,"[literary, fiction literary, fiction, fiction ...",[fiction],the dial press,i wonder how the book got to guernsey perhaps ...



--- GENRE COVERAGE AFTER ENRICHMENT ---
Books with genres_clean: 8082
Books without genres_clean: 1918
Genre coverage: 80.8%

Books with genres_simplified: 8878
Books without genres_simplified: 1122
Genre simplified coverage: 88.8%
Books with description_clean: 8518
Books without description_clean: 1482
Books description coverage: 85.2%
Books with publisher_clean: 9469
Books without publisher_clean: 531
Books publisher coverage: 94.7%


In [16]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/enriched")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 3

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v3 saved successfully in data/interim/merge directory.


#### Querying Google Books API (with Quota Management)

After OpenLibrary enrichment, we create a new mask to identify remaining gaps: **1,730** books. Google Books API requires an API key and has daily quota limits (1,000 requests/day for free tier), so we implement several strategies: **(1)** process ISBNs in chunks of 1,000, **(2)** add sleep delays between requests, **(3)** cache all results to avoid re-querying, and **(4)** save progress incrementally.

We load existing cache if available, query only uncached ISBNs, and update the cache after each session. This approach allows us to spread queries across multiple days if needed while preserving all previous work.

In [17]:
# Check how many books still need enrichment
new_missing_mask = (
    gb_enriched['language_clean'].isna() |
    gb_enriched['language_clean'].isin(['unknown', '', 'None']) |
    gb_enriched['pages_clean'].isna() |
    gb_enriched['publication_date_clean'].isna() | 
    gb_enriched['publisher_clean'].isna() |
    gb_enriched['description_clean'].isna()
)

new_to_impute = gb_enriched[new_missing_mask].copy()
print("Books still needing external enrichment:", len(new_to_impute))

# Show breakdown by field
print("\nBreakdown of remaining missing values:")
print(f"  - Missing pages: {gb_enriched['pages_clean'].isna().sum()}")
print(f"  - Missing publication_date: {gb_enriched['publication_date_clean'].isna().sum()}")
print(f"  - Missing/invalid language: {(gb_enriched['language_clean'].isna() | gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()}")
print(f"  - Missing publisher: {gb_enriched['publisher_clean'].isna().sum()}")
print(f"  - Missing description: {gb_enriched['description_clean'].isna().sum()}")  

Books still needing external enrichment: 1728

Breakdown of remaining missing values:
  - Missing pages: 690
  - Missing publication_date: 2
  - Missing/invalid language: 95
  - Missing publisher: 531
  - Missing description: 1482


In [None]:
def chunk_list(data, size=1000):
    for i in range(0, len(data), size):
        yield data[i:i+size]

In [18]:
import json
from pathlib import Path

# Define cache path for Google Books in data/raw
CACHE_PATH = Path("data/raw/google_api_cache.json")

# Create directory if it doesn't exist
CACHE_PATH.parent.mkdir(parents=True, exist_ok=True)

# Load existing cache if it exists
if CACHE_PATH.exists():
    with open(CACHE_PATH, "r") as f:
        google_cache = json.load(f)
        print(f"Loaded {len(google_cache)} cached Google Books entries")
else:
    google_cache = {}
    print("No existing cache found, starting fresh")

Loaded 1728 cached Google Books entries


In [None]:
import os
import time
import requests
from dotenv import load_dotenv

load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_BOOKS_API_KEY")

def query_google_books(isbn):
    isbn = str(isbn)

    # Check cache
    if isbn in google_cache:
        return google_cache[isbn]

    # Query Google Books with API key
    url = (
        f"https://www.googleapis.com/books/v1/volumes?"
        f"q=isbn:{isbn}&key={GOOGLE_API_KEY}"
    )

    r = requests.get(url)

    if r.status_code != 200:
        result = {"isbn": isbn, "error": f"HTTP {r.status_code}"}
    else:
        data = r.json()
        if "items" in data and data["items"]:
            volume = data["items"][0]["volumeInfo"]
            result = {
                "isbn": isbn,
                "title": volume.get("title"),
                "authors": volume.get("authors"),
                "publisher": volume.get("publisher"),
                "publishedDate": volume.get("publishedDate"),
                "pageCount": volume.get("pageCount"),
                "categories": volume.get("categories"),
                "language": volume.get("language"),
                "description": volume.get("description"),
            }
        else:
            result = {"isbn": isbn, "error": "No results"}

    # Save to cache
    google_cache[isbn] = result
    return result

In [None]:
list_of_isbns = (
    new_to_impute['isbn_query']
    .dropna()
    .astype(str)
    .str.strip()
    .unique()
    .tolist()
)
print(f"Unique ISBNs to query: {len(list_of_isbns)}")
chunks = list(chunk_list(list_of_isbns, size=1000))

In [None]:
from tqdm import tqdm

# Choose which chunk you want to process today
chunk_to_process = chunks[1]   # run chunk 0 today, 1 tomorrow, etc.
results = []
for isbn in tqdm(chunk_to_process, desc="Querying Google Books"):
    results.append(query_google_books(isbn))
    time.sleep(0.1)   # be nice to the API

In [None]:
# Save cache after ISBN queries
with open(CACHE_PATH, "w") as f:
    json.dump(google_cache, f, indent=2)
print(f"Cache updated with {len(google_cache)} entries after ISBN queries")

#### Handling Books Without ISBNs

Some books lack valid ISBNs but can still be enriched using **title and author search**. Google Books API supports `intitle:` and `inauthor:` query parameters, allowing us to find books by bibliographic metadata instead of identifiers. We create cache keys in `"title|author"` format to distinguish these from ISBN-based queries.

This fallback strategy significantly increases our enrichment coverage, especially for older books, special editions, or records with ISBN errors. Results are cached alongside ISBN queries to maintain a unified enrichment workflow.

In [None]:

def query_google_books_by_title(title, author):
    """Query Google Books API using title and author when ISBN is unavailable."""
    
    # Create cache key
    cache_key = f"{title}|{author}"
    
    if cache_key in google_cache:
        return google_cache[cache_key]
    
    # Build query string
    query_parts = []
    if pd.notna(title):
        query_parts.append(f'intitle:"{title}"')
    if pd.notna(author):
        query_parts.append(f'inauthor:"{author}"')
    
    query_string = "+".join(query_parts)
    
    url = (
        f"https://www.googleapis.com/books/v1/volumes?"
        f"q={query_string}&key={GOOGLE_API_KEY}"
    )
    
    r = requests.get(url)
    
    if r.status_code != 200:
        result = {"title": title, "author": author, "error": f"HTTP {r.status_code}"}
    else:
        data = r.json()
        if "items" in data and data["items"]:
            volume = data["items"][0]["volumeInfo"]
            result = {
                "title": title,
                "author": author,
                "pageCount": volume.get("pageCount"),
                "publisher": volume.get("publisher"),  
                "publishedDate": volume.get("publishedDate"),
                "categories": volume.get("categories"),
                "language": volume.get("language"),
                "description": volume.get("description"),
            }
        else:
            result = {"title": title, "author": author, "error": "No results"}
    
    google_cache[cache_key] = result
    return result

# Process books without ISBN separately
books_without_isbn = new_to_impute[new_to_impute['isbn_query'].isna()].copy()
print(f"Books without ISBN to query by title/author: {len(books_without_isbn)}")

results_by_title = []
for idx, row in tqdm(books_without_isbn.iterrows(), 
                     total=len(books_without_isbn),
                     desc="Querying Google Books by title/author"):
    results_by_title.append(query_google_books_by_title(
        row['title_clean'], 
        row['author_clean']
    ))
    time.sleep(0.1)

In [None]:
# Save cache after title/author queries
with open(CACHE_PATH, "w") as f:
    json.dump(google_cache, f, indent=2)
print(f"Cache updated with {len(google_cache)} entries after title/author queries")

#### Loading and Applying Cached Results

The Google Books cache contains results from multiple query sessions, potentially across different days. We load the complete cache and separate ISBN-based results from title/author-based results by checking for the `"|"` delimiter in cache keys. This allows us to apply different matching logic for each result type.

We then merge cached data back into `gb_enriched`, apply the cleaning pipeline to standardize formats, and fill remaining metadata gaps. The `map_subjects_to_genres()` function maps Google Books categories to our genre taxonomy, further increasing genre coverage. This completes our multi-source enrichment strategy.

In [19]:
import json
import pandas as pd
from pathlib import Path

# Load the cache
CACHE_PATH = Path("data/raw/google_api_cache.json")

with open(CACHE_PATH, "r") as f:
    google_cache = json.load(f)

print(f"Loaded {len(google_cache)} cached entries")

# Separate ISBN-based results from title/author-based results
isbn_results = []
title_author_results = []

for key, value in google_cache.items():
    if "|" in key:  # Title|Author format
        title_author_results.append(value)
    else:  # ISBN format
        isbn_results.append(value)

print(f"Found {len(isbn_results)} ISBN-based results")
print(f"Found {len(title_author_results)} title/author-based results")

# add google books data to dataframe
google_columns = [
    'pageCount_google',
    'publishedDate_google',
    'categories_google',
    'language_google',
    'publisher_google',
    'description_google'
]
for col in google_columns:
    if col not in gb_enriched.columns:
        gb_enriched[col] = None

Loaded 1728 cached entries
Found 1276 ISBN-based results
Found 452 title/author-based results


In [20]:
# merge isbn_results back to gb_enriched
if isbn_results:
    google_isbn_df = pd.DataFrame(isbn_results)
    print("\nISBN results preview:")
    print(google_isbn_df.head())
    
    # Create ISBN mapping
    isbn_to_data = {str(row['isbn']): row for _, row in google_isbn_df.iterrows() 
                    if 'isbn' in row and pd.notna(row.get('isbn'))}
    
    # Update gb_enriched
    books_with_isbn = new_to_impute[new_to_impute['isbn_query'].notna()].copy()
    
    for idx in books_with_isbn.index:
        isbn = str(books_with_isbn.loc[idx, 'isbn_query'])
        if isbn in isbn_to_data:
            result = isbn_to_data[isbn]
            if pd.isna(result.get('error')):
                gb_enriched.loc[idx, 'pageCount_google'] = result.get('pageCount')
                gb_enriched.loc[idx, 'publishedDate_google'] = result.get('publishedDate')
                gb_enriched.loc[idx, 'categories_google'] = result.get('categories')
                gb_enriched.loc[idx, 'language_google'] = result.get('language')
                gb_enriched.loc[idx, 'publisher_google'] = result.get('publisher')
                gb_enriched.loc[idx, 'description_google'] = result.get('description')
    
    print(f"Merged ISBN-based results for {len(isbn_to_data)} books")


# merge title/author results back to gb_enriched

if title_author_results:
    google_title_df = pd.DataFrame(title_author_results)
    print("\nTitle/Author results preview:")
    print(google_title_df.head())
    
    # Recreate books_without_isbn from new_to_impute
    books_without_isbn = new_to_impute[new_to_impute['isbn_query'].isna()].copy()
    
    # Create title|author key mapping
    for i, (idx, row) in enumerate(books_without_isbn.iterrows()):
        if i < len(google_title_df):
            title_author_key = f"{row['title_clean']}|{row['author_clean']}"
            if title_author_key in google_cache:
                result = google_cache[title_author_key]
                if pd.isna(result.get('error')):
                    gb_enriched.loc[idx, 'pageCount_google'] = result.get('pageCount')
                    gb_enriched.loc[idx, 'publishedDate_google'] = result.get('publishedDate')
                    gb_enriched.loc[idx, 'categories_google'] = result.get('categories')
                    gb_enriched.loc[idx, 'language_google'] = result.get('language')
                    gb_enriched.loc[idx, 'publisher_google'] = result.get('publisher')
                    gb_enriched.loc[idx, 'description_google'] = result.get('description')
    
    print(f"Merged title/author-based results for {len(books_without_isbn)} books")

# Verify merge
print("\nGoogle Books data merged:")
for col in google_columns:
    count = gb_enriched[col].notna().sum()
    print(f"  - {col}: {count} values")

# clean Google Books API data
from src.cleaning.utils.pipeline import apply_cleaners_selectively

gb_enriched = apply_cleaners_selectively(
    gb_enriched,
    fields_to_clean=[
        'pageCount',
        'publishedDate',
        'language',
        'categories',
        'publisher',
        'description'
        ],
    source_suffix='_google',
    target_suffix='_google_clean',
    inplace=False
)

# Verify cleaning
print("\nSample of cleaned Google Books data:")
display(gb_enriched[[
    'title_clean',
    'pages_clean',
    'pageCount_google',
    'pageCount_google_clean',
    'publication_date_clean',
    'publishedDate_google',
    'publishedDate_google_clean',
    'language_clean',
    'language_google',
    'language_google_clean',
    'genres_clean',
    'categories_google',
    'categories_google_clean',
    'genres_simplified',
    'publisher_google',
    'publisher_clean',
    'description_google',
    'description_clean',
]].dropna(subset=['pageCount_google_clean', 'language_google_clean'], how='all').sample(min(15, len(gb_enriched)), random_state=42))


ISBN results preview:
         isbn                       title             authors  \
0  0452284244                 Animal Farm     [George Orwell]   
1  0618346252  The Fellowship of the Ring  [J. R. R. Tolkien]   
2  0739326228         Memoirs of a Geisha     [Arthur Golden]   
3  0965818675                         NaN                 NaN   
4  0553816713                         NaN                 NaN   

                             publisher publishedDate  pageCount categories  \
0                              Penguin    2003-05-06      129.0  [Fiction]   
1                        Mariner Books       2003-09      398.0  [Fiction]   
2  Random House Large Print Publishing          2005      758.0  [Fiction]   
3                                  NaN           NaN        NaN        NaN   
4                                  NaN           NaN        NaN        NaN   

  language                                        description     error  
0       en  75th Anniversary Edition—Includ

Unnamed: 0,title_clean,pages_clean,pageCount_google,pageCount_google_clean,publication_date_clean,publishedDate_google,publishedDate_google_clean,language_clean,language_google,language_google_clean,genres_clean,categories_google,categories_google_clean,genres_simplified,publisher_google,publisher_clean,description_google,description_clean
3938,history of art,1000.0,1000.0,1000.0,1962-01-01,1997,1997-01-01,en,en,en,,[Art],[art],[historical fiction],,thames and hudson,The fifth edition of this work is revised by t...,
1844,alias grace,636.0,636.0,636.0,1996-01-01,1998,1998-01-01,de,de,,,,,,,btb bei goldmann,,
9338,the protector,322.0,370.0,370.0,2001-01-01,2005-10,2005-10-01,en,en,en,,[Fiction],[fiction],"[fiction, romance]","Tyndale House Publishers, Inc.",tyndale house publishers,C.1 ST. AID B & T. 07-25-2007. $13.99.,
6471,the hidden city,512.0,514.0,514.0,1994-01-01,1995-08-01,1995-08-01,en,en,en,,[Fiction],[fiction],"[fiction, fantasy]",Del Rey,del rey,Sparhawk’s epic quest comes to a riveting conc...,
7502,what katy did,136.0,136.0,136.0,1872-01-01,1999-01-01,1999-01-01,en,en,en,,,,"[fiction, young adult, children]",,adamant media corporation,This Elibron Classics title is a reprint of th...,
7222,what the night knows,,442.0,442.0,2010-01-01,2010,2010-01-01,en,en,en,,[Fiction],[fiction],,Bantam Dell Publishing Group,penguin random house,A companion to The Darkest Evening of the Year...,
7205,stars of fortune,314.0,337.0,337.0,2015-01-01,2015-11-03,2015-11-03,en,en,en,,[Fiction],[fiction],[fiction],Berkley,,"Includes excerpt from author's, ""The Obsession...",to celebrate the rise of their new queen three...
3265,naruto -ナルト- 巻ノ四十三,248.0,248.0,248.0,2008-01-01,2008-08,2008-08-01,ja,ja,,"['fantasy', 'comics', 'graphic novels', 'anime...","[Comic books, strips, etc]",[comic books strips etc],"['fantasy', 'comics', 'graphic novels', 'other...",Shueisha/Tsai Fong Books,shueisha,This series has won the highest rating both as...,
4956,shadow spell,339.0,354.0,354.0,2014-01-01,2014-03-25,2014-03-25,en,en,en,,[Fiction],[fiction],"[fiction, fantasy, romance]",Penguin,,From #1 New York Times bestselling author Nora...,with the legends and lore of ireland running t...
6318,shattered,304.0,308.0,308.0,1973-01-01,1986-11-15,1986-11-15,en,en,en,,[Fiction],[fiction],,Penguin,berkley,"Getting there is supposed to be half the fun, ...",


In [21]:
# after cleaning Google Books data, we'll add genre mapping
print("\n--- Generating genres_simplified from Google Books categories ---")

books_needing_google_genre_mapping = (
    (gb_enriched['genres_simplified'].isna()) & 
    (gb_enriched['categories_google_clean'].notna())
)

print(f"Books with Google categories but no genres_simplified: {books_needing_google_genre_mapping.sum()}")

if books_needing_google_genre_mapping.sum() > 0:
    gb_enriched.loc[books_needing_google_genre_mapping, 'genres_simplified'] = (
        gb_enriched.loc[books_needing_google_genre_mapping, 'categories_google_clean']
        .apply(lambda x: map_subjects_to_genres(x) if isinstance(x, list) else None)
    )
    
    filled_genres = (
        gb_enriched.loc[books_needing_google_genre_mapping, 'genres_simplified'].notna().sum()
    )
    print(f"genres_simplified: mapped {filled_genres} values from Google Books categories")


# fill missing values with cleaned Google Books data
print("\n--- Filling missing values with cleaned Google Books data ---")

# Fill pages_clean
print("\n--- Filling remaining page_clean using Google Books data ---")
before_pages = gb_enriched['pages_clean'].isna().sum()
gb_enriched['pages_clean'] = gb_enriched['pages_clean'].fillna(gb_enriched['pageCount_google_clean'])
after_pages = gb_enriched['pages_clean'].isna().sum()
print(f"pages_clean: filled {before_pages - after_pages} values from Google Books")

# Fill publication_date_clean
print("\n--- Filling remaining publication_date_clean using Google Books data ---")
before_date = gb_enriched['publication_date_clean'].isna().sum()
gb_enriched['publication_date_clean'] = gb_enriched['publication_date_clean'].fillna(gb_enriched['publishedDate_google_clean'])
after_date = gb_enriched['publication_date_clean'].isna().sum()
print(f"publication_date_clean: filled {before_date - after_date} values from Google Books")

# Fill language_clean
print("\n--- Filling remaining language_clean using Google Books data ---")
before_lang = (gb_enriched['language_clean'].isna() | 
               gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()
mask = (gb_enriched['language_clean'].isna() | 
        gb_enriched['language_clean'].isin(['unknown', '', 'None']))
gb_enriched.loc[mask, 'language_clean'] = gb_enriched.loc[mask, 'language_google_clean']
after_lang = (gb_enriched['language_clean'].isna() | 
              gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()

print(f"language_clean: filled {before_lang - after_lang} values from Google Books")

# Fill publisher_clean
print("\n--- Filling remaining publisher_clean using Google Books data ---")
before_publisher = gb_enriched['publisher_clean'].isna().sum()
gb_enriched['publisher_clean'] = gb_enriched['publisher_clean'].fillna(
    gb_enriched['publisher_google_clean']
)
after_publisher = gb_enriched['publisher_clean'].isna().sum()
print(f"publisher_clean: filled {before_publisher - after_publisher} values from Google Books")


# Fill description_clean
print("\n--- Filling remaining description_clean using Google Books data ---")
before_desc_google = gb_enriched['description_clean'].isna().sum()
gb_enriched['description_clean'] = gb_enriched['description_clean'].fillna(
    gb_enriched['description_google_clean']
)
after_desc_google = gb_enriched['description_clean'].isna().sum()

print(f"description_clean (Google): filled {before_desc_google - after_desc_google} values")

# final enrichment summary
print("\n--- FINAL ENRICHMENT SUMMARY (ALL SOURCES) ---")

print(f"\nTotal books enriched with Google Books data: {gb_enriched[gb_enriched['categories_google_clean'].notna()].shape[0]}")

print("\n--- FINAL METADATA COVERAGE ---")
print(f"Books with pages_clean: {gb_enriched['pages_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['pages_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with publication_date_clean: {gb_enriched['publication_date_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['publication_date_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with publisher_clean: {gb_enriched['publisher_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['publisher_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with description_clean: {gb_enriched['description_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['description_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
valid_language = gb_enriched['language_clean'].notna() & ~gb_enriched['language_clean'].isin(['unknown', '', 'None'])
print(f"Books with valid language_clean: {valid_language.sum()} / {len(gb_enriched)} ({valid_language.sum() / len(gb_enriched) * 100:.1f}%)")

print(f"\n--- FINAL GENRE COVERAGE ---")
print(f"Books with genres_clean: {gb_enriched['genres_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['genres_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with genres_simplified: {gb_enriched['genres_simplified'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['genres_simplified'].notna().sum() / len(gb_enriched) * 100:.1f}%)")


--- Generating genres_simplified from Google Books categories ---
Books with Google categories but no genres_simplified: 583
genres_simplified: mapped 376 values from Google Books categories

--- Filling missing values with cleaned Google Books data ---

--- Filling remaining page_clean using Google Books data ---
pages_clean: filled 252 values from Google Books

--- Filling remaining publication_date_clean using Google Books data ---
publication_date_clean: filled 0 values from Google Books

--- Filling remaining language_clean using Google Books data ---
language_clean: filled 51 values from Google Books

--- Filling remaining publisher_clean using Google Books data ---
publisher_clean: filled 139 values from Google Books

--- Filling remaining description_clean using Google Books data ---
description_clean (Google): filled 992 values

--- FINAL ENRICHMENT SUMMARY (ALL SOURCES) ---

Total books enriched with Google Books data: 1143

--- FINAL METADATA COVERAGE ---
Books with pages_c

In [22]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/enriched")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 4

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v4 saved successfully in data/interim/merge directory.


Our multi-source enrichment strategy (BBE → OpenLibrary → Google Books) achieved excellent metadata coverage: **95.1%** for page counts, **100%** for publication dates, **99.4%** for valid language codes, **94.7%** publishers and **85.2%** descriptions. Genre coverage reached **80.8%** for `genres_clean` and **90.7%** for `genres_simplified`, a significant improvement from the original Goodbooks dataset which lacked genre information entirely.

This enriched dataset now provides a comprehensive foundation for modeling and analysis. The combination of catalog metadata from BBE, behavioral data from Goodbooks ratings, and API-sourced supplemental information creates a unified dataset that supports both predictive modeling and catalog diversity analysis. The next step is filtering to English-language titles and preparing the final model-ready dataset.

#### Final Metadata Enrichment Steps

After completing API-based enrichment, we perform three final metadata enhancement steps to ensure completeness and consistency across all enriched records:

1. **Major Publisher Classification**: We apply the `is_major_publisher` flag to books that received publisher data from APIs but weren't present in the BBE dataset. Using the same publisher pattern matching from Notebook 02, we classify publishers against our curated list of major publishing houses.

2. **Awards Flag Completion**: We fill missing `has_award` values with `False` for all books that weren't in the BBE dataset (which contains award metadata). Since API sources don't reliably provide award information, we assume absence of award data means no awards.

3. **NLP-Ready Description Generation**: For books enriched with API descriptions, we apply the same NLP cleaning pipeline used in Notebook 02. This converts descriptions into analysis-ready text by removing HTML tags, normalizing whitespace, and standardizing punctuatio, ensuring consistency across BBE and API-sourced descriptions for future text analysis tasks.

These steps ensure that all enriched books have the same metadata structure and quality as the original BBE dataset, maintaining consistency across the entire unified dataset.

In [23]:
import json

# load publisher patterns from JSON file
with open("src/cleaning/mappings/publisher_parent_mapping.json", "r", encoding="utf-8") as f:
    major_publishers = json.load(f)

mask_missing_major = gb_enriched['is_major_publisher'].isna()

gb_enriched.loc[mask_missing_major, 'is_major_publisher'] = (
    gb_enriched.loc[mask_missing_major, 'publisher_clean']
        .str.lower()
        .apply(lambda x: any(mp in x for mp in major_publishers) if isinstance(x, str) else False)
)
print(f"Books without is_major_publisher flag: {gb_enriched['is_major_publisher'].isna().sum()}")

Books without is_major_publisher flag: 0


In [24]:
print(f"Books without has_awards flag: {gb_enriched['has_award'].isna().sum()}")
gb_enriched['has_award'] = (
    gb_enriched['has_award']
    .fillna(False)
    .astype('bool')
)
print(f"Remaining books without has_awards flag: {gb_enriched['has_award'].isna().sum()}")


Books without has_awards flag: 1918
Remaining books without has_awards flag: 0


In [25]:
from src.cleaning.utils.text_cleaning import (
    clean_description_nlp
)
# generate description_nlp from description_clean where API descriptions exist
mask_api_desc = (
    gb_enriched['description_openlib'].notna() |
    gb_enriched['description_google'].notna()
)

gb_enriched.loc[mask_api_desc, 'description_nlp'] = (
    gb_enriched.loc[mask_api_desc, 'description_clean']
        .apply(clean_description_nlp)
)

In [26]:
import numpy as np

def fix_missing_text(col):
    return (
        gb_enriched[col]
        .replace(["", " ", "None", "none", "nan", "Nan", "NAN"], np.nan)
        .replace(r"^\s+$", np.nan, regex=True)
    )

gb_enriched['description_clean'] = fix_missing_text('description_clean')
gb_enriched['description_nlp']   = fix_missing_text('description_nlp')

print("Missing description_clean:", gb_enriched['description_clean'].isna().sum())
print("Missing description_nlp:", gb_enriched['description_nlp'].isna().sum())
gb_enriched['description_clean'] = gb_enriched['description_clean'].astype("string")
gb_enriched['description_nlp']   = gb_enriched['description_nlp'].astype("string")


Missing description_clean: 491
Missing description_nlp: 491


### Filtering for English-Language Books

To ensure consistency and focus for downstream analysis and modeling, we filter the enriched dataset to include only **English-language books**. This step is critical for:

- **Genre diversity analysis**: Comparing genre distributions across a linguistically consistent corpus
- **Ratings behavior modeling**: Ensuring user rating patterns reflect a common language context
- **Text analysis (stretch)**: Enabling NLP tasks on descriptions without multilingual complexity

We create a filtered copy of `gb_enriched` containing only books where `language_clean` is identified as English (using ISO 639 language code`'en'`). This filtered dataset will serve as the primary input for modeling and analysis, while the full enriched dataset (including non-English titles) is preserved for reference.

The English-only dataset is saved as the final output, ready for exploratory analysis and model development in subsequent notebooks.

In [27]:
gb_enriched_en = gb_enriched[gb_enriched['language_clean'] == 'en'].copy()
print("Filtered (EN only):", gb_enriched_en.shape)

Filtered (EN only): (9761, 61)


In [28]:
# drop intermediate enrichment columns
enrichment_columns_to_drop = [
    # OpenLibrary raw and intermediate columns
    'pages_openlib',
    'publication_date_openlib',
    'language_openlib',
    'subjects_openlib',
    'publisher_openlib',
    'description_openlib',
    'pages_openlib_clean',
    'publication_date_openlib_clean',
    'language_openlib_clean',
    'subjects_openlib_clean',
    'publisher_openlib_clean',
    'description_openlib_clean',
    # Google Books raw and intermediate columns
    'pageCount_google',
    'publishedDate_google',
    'categories_google',
    'language_google',
    'publisher_google',
    'description_google',
    'pageCount_google_clean',
    'publishedDate_google_clean',
    'language_google_clean',
    'categories_google_clean',
    'publisher_google_clean',
    'description_google_clean',
    # Query helper column
    'isbn_query',
    # Other intermediate columns
    'isbn13_clean',
]

# drop enrichment columns
gb_enriched_en = gb_enriched_en.drop(columns=enrichment_columns_to_drop, errors='ignore')

print(f"Columns after dropping enrichment data: {len(gb_enriched_en.columns)}")
print(f"Shape before dropping duplicates: {gb_enriched_en.shape}")

# drop duplicate rows based on goodreads_id_clean (keep first occurrence)
gb_enriched_en = gb_enriched_en.drop_duplicates(subset=['goodreads_id_clean'], keep='first')

print(f"Shape after dropping duplicates: {gb_enriched_en.shape}")
print(f"Duplicates removed: {gb_enriched_en.shape[0]}")

# Verify final columns
print("\nFinal columns for analysis:")
print(gb_enriched_en.columns.tolist())

Columns after dropping enrichment data: 35
Shape before dropping duplicates: (9761, 35)
Shape after dropping duplicates: (9761, 35)
Duplicates removed: 9761

Final columns for analysis:
['book_id', 'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'goodreads_id_clean', 'best_book_id_clean', 'work_id_clean', 'authors_list', 'author_clean', 'language_clean', 'publication_date_clean', 'isbn_clean', 'isbn_standard', 'rating_clean', 'numRatings_clean', 'numRatings_log', 'ratings_1_share', 'ratings_2_share', 'ratings_3_share', 'ratings_4_share', 'ratings_5_share', 'work_text_reviews_log', 'series_clean', 'title_clean', 'pages_clean', 'genres_clean', 'genres_simplified', 'publisher_clean', 'is_major_publisher', 'has_award', 'description_clean', 'description_nlp']


In [29]:
# ================================================
# FINAL DATA QUALITY CHECKS BEFORE SAVING
# ================================================

print("=" * 60)
print("FINAL DATA QUALITY CHECKS - gb_enriched_en")
print("=" * 60)

# check for duplicates
print("\n1. DUPLICATE CHECK")
print(f"Total rows: {len(gb_enriched_en)}")
duplicates = gb_enriched_en.duplicated(subset=['goodreads_id_clean']).sum()
print(f"Duplicate goodreads_id_clean: {duplicates}")
if duplicates > 0:
    print("WARNING: Duplicates found!")
    display(gb_enriched_en[gb_enriched_en.duplicated(subset=['goodreads_id_clean'], keep=False)])

# check for null values in critical columns
print("\n2. NULL VALUES IN CRITICAL COLUMNS")
critical_cols = ['goodreads_id_clean', 'title_clean', 'author_clean', 'language_clean']
for col in critical_cols:
    null_count = gb_enriched_en[col].isna().sum()
    print(f"{col}: {null_count} nulls ({null_count/len(gb_enriched_en)*100:.2f}%)")
    if null_count > 0:
        print(f"WARNING: Nulls found in {col}!")

# verify language filter worked
print("\n3. LANGUAGE CONSISTENCY CHECK")
unique_langs = gb_enriched_en['language_clean'].unique()
print(f"Unique languages: {unique_langs}")
if len(unique_langs) > 1 or unique_langs[0] != 'en':
    print("WARNING: Non-English books found after filtering!")

# check data types
print("\n4. DATA TYPE CHECK")
print(gb_enriched_en.dtypes)

# check for empty strings or whitespace-only values
print("\n5. EMPTY STRING CHECK")
text_cols = ['title_clean', 'author_clean', 'publisher_clean', 'description_clean']
for col in text_cols:
    if col in gb_enriched_en.columns:
        empty = (gb_enriched_en[col] == '').sum()
        whitespace = gb_enriched_en[col].str.strip().eq('').sum()
        print(f"{col}: {empty} empty strings, {whitespace} whitespace-only")

# check metadata coverage
print("\n6. METADATA COVERAGE")
metadata_cols = {
    'pages_clean': 'Pages',
    'publication_date_clean': 'Publication Date',
    'publisher_clean': 'Publisher',
    'genres_simplified': 'Genres',
    'description_clean': 'Description'
}
for col, label in metadata_cols.items():
    if col in gb_enriched_en.columns:
        coverage = gb_enriched_en[col].notna().sum() / len(gb_enriched_en) * 100
        print(f"{label}: {coverage:.1f}% coverage")

# check for unexpected enrichment columns still present
print("\n7. ENRICHMENT COLUMN CHECK")
enrichment_patterns = ['_openlib', '_google', '_bbe', 'isbn_query']
leftover_cols = [col for col in gb_enriched_en.columns 
                 if any(pattern in col for pattern in enrichment_patterns)]
if leftover_cols:
    print(f"WARNING: Leftover enrichment columns found: {leftover_cols}")
else:
    print("No enrichment columns remaining")

# aummary statistics
print("\n8. SUMMARY STATISTICS")
print(f"Final dataset shape: {gb_enriched_en.shape}")
print(f"Columns: {len(gb_enriched_en.columns)}")
print(f"Memory usage: {gb_enriched_en.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\n" + "=" * 60)
print("CHECKS COMPLETE")
print("=" * 60)

FINAL DATA QUALITY CHECKS - gb_enriched_en

1. DUPLICATE CHECK
Total rows: 9761
Duplicate goodreads_id_clean: 0

2. NULL VALUES IN CRITICAL COLUMNS
goodreads_id_clean: 0 nulls (0.00%)
title_clean: 0 nulls (0.00%)
author_clean: 0 nulls (0.00%)
language_clean: 0 nulls (0.00%)

3. LANGUAGE CONSISTENCY CHECK
Unique languages: ['en']

4. DATA TYPE CHECK
book_id                      int64
work_text_reviews_count      int64
ratings_1                    int64
ratings_2                    int64
ratings_3                    int64
ratings_4                    int64
ratings_5                    int64
goodreads_id_clean          string
best_book_id_clean           int64
work_id_clean                int64
authors_list                object
author_clean                object
language_clean              object
publication_date_clean      object
isbn_clean                  string
isbn_standard               object
rating_clean               float64
numRatings_clean             int64
numRatings_log     

In [30]:
from pathlib import Path

# Create data folder if not exists
file_name = 'en_internal_catalog'
clean_path = Path("outputs/datasets/cleaned")
clean_path.mkdir(parents=True, exist_ok=True)

gb_enriched_en.to_csv(clean_path / f"{file_name}.csv", index=False)

print(f"{file_name} saved successfully in {clean_path} directory.")

en_internal_catalog saved successfully in outputs\datasets\cleaned directory.


## BBE English Only 

The BBE dataset has already been cleaned in the previous steps. Since we are not enriching it further, we only need to filter it to English-language books for consistency with the Goodbooks dataset. 

In [31]:
en_supply_catalog = bbe_clean[ bbe_clean["language_clean"] == "en" ]

# check shape after filtering
print(f"Original BBE dataset shape: {bbe_clean.shape}")
print(f"English-only BBE dataset shape: {en_supply_catalog.shape}")
print(f"Books filtered out (non-English): {bbe_clean.shape[0] - en_supply_catalog.shape[0]}")
print(f"English books percentage: {(en_supply_catalog.shape[0] / bbe_clean.shape[0]) * 100:.1f}%")

# display sample
print("\nSample of English-only BBE dataset:")
display(en_supply_catalog.head(3))

Original BBE dataset shape: (52424, 36)
English-only BBE dataset shape: (42634, 36)
Books filtered out (non-English): 9790
English books percentage: 81.3%

Sample of English-only BBE dataset:


Unnamed: 0,goodreads_id_clean,authors_list,author_clean,title_clean,isbn_clean,language_clean,publication_date_clean,publisher_clean,is_major_publisher,bookFormat_clean,...,description_clean,description_nlp,series_clean,pages_clean,bbeVotes_clean,bbeScore_clean,likedPercent_clean,has_likedPercent,price_clean,price_flag
0,2767052,['suzanne collins'],suzanne collins,the hunger games,9780439023481.0,en,2008-09-14,scholastic,True,hardcover,...,winning means fame and fortunelosing means cer...,winning means fame and fortunelosing means cer...,the hunger games,374.0,30516,2993816,96.0,1,5.09,False
1,2,"['jk rowling', 'mary grandpre']","jk rowling, mary grandpre",harry potter and the order of the phoenix,9780439358071.0,en,2003-06-21,scholastic,True,paperback,...,there is a door at the end of a silent corrido...,there is a door at the end of a silent corrido...,harry potter,870.0,26923,2632233,98.0,1,7.38,False
2,2657,['harper lee'],harper lee,to kill a mockingbird,,en,2007-07-11,harpercollins,True,paperback,...,the unforgettable novel of a childhood in a sl...,the unforgettable novel of a childhood in a sl...,to kill a mockingbird,324.0,23328,2269402,95.0,1,,True


In [32]:

import numpy as np

# ensure en_supply_catalog is a copy to avoid SettingWithCopyWarning
en_supply_catalog = en_supply_catalog.copy()

# apply fix_missing_text and astype directly (not with .loc)
for col in ['description_clean', 'description_nlp']:
    if col in en_supply_catalog.columns:
        en_supply_catalog[col] = fix_missing_text(col)
        en_supply_catalog[col] = en_supply_catalog[col].astype("string")

# report missing text analysis for supply catalog
print("=== Fix Missing Text Analysis for Supply Catalog ===")
for col in ['description_clean', 'description_nlp']:
    if col in en_supply_catalog.columns:
        total = len(en_supply_catalog)
        missing = en_supply_catalog[col].isna().sum()
        empty = (en_supply_catalog[col] == '').sum()
        whitespace = en_supply_catalog[col].str.strip().eq('').sum()
        print(f"{col}:")
        print(f"  Total rows: {total}")
        print(f"  Missing (NaN): {missing} ({missing/total:.2%})")
        print(f"  Empty strings: {empty}")
        print(f"  Whitespace-only: {whitespace}")
        print("-" * 40)

# Add publication_decade to supply/BBE catalog
i# Add publication_year and publication_decade to en_supply_catalog

if 'publication_date_clean' in en_supply_catalog.columns:
    en_supply_catalog['publication_year'] = pd.to_datetime(
        en_supply_catalog['publication_date_clean'], errors='coerce'
    ).dt.year
    en_supply_catalog['publication_decade'] = (en_supply_catalog['publication_year'] // 10) * 10
    print("Added publication_year and publication_decade to en_supply_catalog.")
else:
    print("publication_date_clean column not found in en_supply_catalog.")

# Add is_overlap column
en_supply_catalog['is_overlap'] = en_supply_catalog['goodreads_id_clean'].isin(gb_enriched_en['goodreads_id_clean']).astype(bool)
print(f"Overlap books in supply: {en_supply_catalog['is_overlap'].sum()} / {len(en_supply_catalog)}")

=== Fix Missing Text Analysis for Supply Catalog ===
description_clean:
  Total rows: 42634
  Missing (NaN): 33879 (79.46%)
  Empty strings: 0
  Whitespace-only: 0
----------------------------------------
description_nlp:
  Total rows: 42634
  Missing (NaN): 33879 (79.46%)
  Empty strings: 0
  Whitespace-only: 0
----------------------------------------
Added publication_year and publication_decade to en_supply_catalog.
Overlap books in supply: 7834 / 42634


In [33]:
print("\n" + "=" * 60)
print("FINAL DATA QUALITY CHECKS - en_supply_catalog")
print("=" * 60)

# check for duplicates
print("\n1. DUPLICATE CHECK")
print(f"Total rows: {len(en_supply_catalog)}")
duplicates_bbe = en_supply_catalog.duplicated(subset=['goodreads_id_clean']).sum()
print(f"Duplicate goodreads_id_clean: {duplicates_bbe}")

# verify language filter
print("\n2. LANGUAGE CONSISTENCY CHECK")
unique_langs_bbe = en_supply_catalog['language_clean'].unique()
print(f"Unique languages: {unique_langs_bbe}")

# metadata coverage
print("\n3. METADATA COVERAGE")
for col, label in metadata_cols.items():
    if col in en_supply_catalog.columns:
        coverage = en_supply_catalog[col].notna().sum() / len(en_supply_catalog) * 100
        print(f"{label}: {coverage:.1f}% coverage")

print(f"\nFinal BBE dataset shape: {en_supply_catalog.shape}")

print("\n" + "=" * 60)


FINAL DATA QUALITY CHECKS - en_supply_catalog

1. DUPLICATE CHECK
Total rows: 42634
Duplicate goodreads_id_clean: 0

2. LANGUAGE CONSISTENCY CHECK
Unique languages: ['en']

3. METADATA COVERAGE
Pages: 96.3% coverage
Publication Date: 99.4% coverage
Publisher: 94.3% coverage
Genres: 100.0% coverage
Description: 20.5% coverage

Final BBE dataset shape: (42634, 39)



In [34]:
from pathlib import Path

# Create data folder if not exists
file_name = 'en_supply_catalog'
clean_path = Path("outputs/datasets/cleaned")
clean_path.mkdir(parents=True, exist_ok=True)

en_supply_catalog.to_csv(clean_path / f"{file_name}.csv", index=False)

print(f"{file_name} saved successfully in {clean_path} directory.")

en_supply_catalog saved successfully in outputs\datasets\cleaned directory.


# Dataset Merge Strategy
We merge Goodbooks (internal catalog + ratings) with BBE (external supply catalog) on goodreads_id_clean to create two modeling variants:

## Warm Start Dataset
Includes external BBE signals (ratings, votes, liked %) for **cross-platform validation**:

- Do books popular on BBE also perform well on Goodbooks?
- Useful for analyzing rating transfer patterns across platforms

## Cold Start Dataset
Excludes all external behavioral features to prevent leakage:

- Trains only on intrinsic book metadata (genre, author, publisher, etc.)
- Simulates **new book scenarios** where external platform data is unavailable
- Production-ready for fair model evaluation

In [35]:
# create renaming dictionaries for merging datasets
rename_gb = {
    'book_id': 'gb_book_id',
    'work_text_reviews_count': 'gb_work_text_reviews_count',
    'ratings_1': 'gb_ratings_1',
    'ratings_2': 'gb_ratings_2',
    'ratings_3': 'gb_ratings_3',
    'ratings_4': 'gb_ratings_4',
    'ratings_5': 'gb_ratings_5',

    # KEEP MERGE KEYS AS IS:
    # 'goodreads_id_clean': 'goodreads_id_clean',

    'best_book_id_clean': 'gb_best_book_id_clean',
    'work_id_clean': 'gb_work_id_clean',
    'authors_list': 'gb_authors_list',
    'author_clean': 'gb_author_clean',
    'language_clean': 'gb_language_clean',
    'publication_date_clean': 'gb_publication_date_clean',
    'isbn_clean': 'gb_isbn_clean',
    'rating_clean': 'gb_rating_clean',                  # TARGET
    'numRatings_clean': 'gb_numRatings_clean',
    'numRatings_log': 'gb_numRatings_log',
    'ratings_1_share': 'gb_ratings_1_share',
    'ratings_2_share': 'gb_ratings_2_share',
    'ratings_3_share': 'gb_ratings_3_share',
    'ratings_4_share': 'gb_ratings_4_share',
    'ratings_5_share': 'gb_ratings_5_share',
    'work_text_reviews_log': 'gb_work_text_reviews_log',
    'series_clean': 'gb_series_clean',
    'title_clean': 'gb_title_clean',
    'pages_clean': 'gb_pages_clean',
    'genres_clean': 'gb_genres_clean',
    'genres_simplified': 'gb_genres_simplified',
    'publisher_clean': 'gb_publisher_clean',
    'is_major_publisher': 'gb_is_major_publisher',
    'has_award': 'gb_has_award',
    'description_clean': 'gb_description_clean',
    'description_nlp': 'gb_description_nlp'
}

rename_bbe = {
    # shared key remains untouched
    # 'goodreads_id_clean': 'goodreads_id_clean'

    'authors_list': 'bbe_authors_list',
    'author_clean': 'bbe_author_clean',
    'title_clean': 'bbe_title_clean',
    'isbn_clean': 'bbe_isbn_clean',
    'language_clean': 'bbe_language_clean',
    'publication_date_clean': 'bbe_publication_date_clean',
    'publisher_clean': 'bbe_publisher_clean',
    'is_major_publisher': 'bbe_is_major_publisher',
    'bookFormat_clean': 'bbe_bookFormat_clean',

    'rating_clean': 'bbe_rating_clean',  # EXTERNAL RATING (predictive feature for model 1)
    'numRatings_clean': 'bbe_numRatings_clean',
    'numRatings_log': 'bbe_numRatings_log',

    'ratings_1': 'bbe_ratings_1',
    'ratings_2': 'bbe_ratings_2',
    'ratings_3': 'bbe_ratings_3',
    'ratings_4': 'bbe_ratings_4',
    'ratings_5': 'bbe_ratings_5',

    'ratings_1_share': 'bbe_ratings_1_share',
    'ratings_2_share': 'bbe_ratings_2_share',
    'ratings_3_share': 'bbe_ratings_3_share',
    'ratings_4_share': 'bbe_ratings_4_share',
    'ratings_5_share': 'bbe_ratings_5_share',

    'has_award': 'bbe_has_award',
    'genres_clean': 'bbe_genres_clean',
    'genres_simplified': 'bbe_genres_simplified',
    'description_clean': 'bbe_description_clean',
    'description_nlp': 'bbe_description_nlp',
    'series_clean': 'bbe_series_clean',
    'pages_clean': 'bbe_pages_clean',

    'bbeVotes_clean': 'bbe_votes_clean',
    'bbeScore_clean': 'bbe_score_clean',
    'likedPercent_clean': 'bbe_likedPercent_clean',
    'has_likedPercent': 'bbe_has_likedPercent',
    'price_clean': 'bbe_price_clean',
    'price_flag': 'bbe_price_flag'
}

# apply renaming
internal_catalog = gb_enriched_en.rename(columns=rename_gb)
supply = en_supply_catalog.rename(columns=rename_bbe)

set(internal_catalog.columns).intersection(set(supply.columns))
# should return {'goodreads_id_clean'}

{'goodreads_id_clean'}

In [36]:
# ensure merge key is string type
internal_catalog['goodreads_id_clean'] = internal_catalog['goodreads_id_clean'].astype(str)
supply['goodreads_id_clean'] = supply['goodreads_id_clean'].astype(str)
# perform left merge
merged = internal_catalog.merge(
    supply,
    on="goodreads_id_clean",
    how="left"
)
print(f"Merged dataset shape: {merged.shape}")
merged.head()

Merged dataset shape: (9761, 73)


Unnamed: 0,gb_book_id,gb_work_text_reviews_count,gb_ratings_1,gb_ratings_2,gb_ratings_3,gb_ratings_4,gb_ratings_5,goodreads_id_clean,gb_best_book_id_clean,gb_work_id_clean,...,bbe_pages_clean,bbe_votes_clean,bbe_score_clean,bbe_likedPercent_clean,bbe_has_likedPercent,bbe_price_clean,bbe_price_flag,publication_year,publication_decade,is_overlap
0,1,155254,66715,127936,560092,1481305,2706317,2767052,2767052,2792775,...,374.0,30516.0,2993816.0,96.0,1.0,5.09,False,2008.0,2000.0,True
1,2,75867,75504,101676,455024,1156318,3011543,3,3,4640799,...,309.0,7348.0,691430.0,96.0,1.0,,True,1997.0,1990.0,True
2,3,95009,456191,436802,793319,875073,1355439,41865,41865,3212258,...,501.0,14874.0,1459448.0,78.0,1.0,2.1,False,2005.0,2000.0,True
3,4,72586,60427,117415,446835,1001952,1714267,2657,2657,3275794,...,324.0,23328.0,2269402.0,95.0,1.0,,True,2007.0,2000.0,True
4,5,51992,86236,197621,606158,936012,947718,4671,4671,245494,...,200.0,8142.0,755074.0,90.0,1.0,,True,2004.0,2000.0,True


In [37]:
# CONSOLIDATION BLOCK: GB + BBE -> final columns

import pandas as pd
from src.modeling.feature_engineering import extract_primary_author

# 1) PUBLICATION_DATE as datetime
# gb_publication_date_clean is object (string), convert safely
merged['gb_publication_date_clean'] = pd.to_datetime(
    merged['gb_publication_date_clean'], errors='coerce'
)

# bbe_publication_date_clean is object (string), convert safely
merged['bbe_publication_date_clean'] = pd.to_datetime(
    merged['bbe_publication_date_clean'], errors='coerce'
)


# 2) Consolidate TITLE
merged['title_final'] = merged['gb_title_clean'].combine_first(
    merged['bbe_title_clean']
)

# 3) Consolidate AUTHOR (single author)
merged['gb_primary_author'] = merged['gb_author_clean'].apply(extract_primary_author)
merged['bbe_primary_author'] = merged['bbe_author_clean'].apply(extract_primary_author)


merged['author_final'] = merged['gb_primary_author'].combine_first(
    merged['bbe_primary_author']
)
# Standardize author_final to lowercase and strip whitespace
merged['author_final'] = merged['author_final'].str.lower().str.strip()


# 4) Consolidate AUTHORS LIST (multi-author)
merged['authors_list_final'] = merged['gb_authors_list'].combine_first(
    merged['bbe_authors_list']
)


# 5) Consolidate LANGUAGE
merged['language_final'] = merged['gb_language_clean'].combine_first(
    merged['bbe_language_clean']
)


# 6) Consolidate PUBLICATION DATE + YEAR + DECADE
merged['publication_date_final'] = merged['gb_publication_date_clean'].combine_first(
    merged['bbe_publication_date_clean']
)

# Derived fields:
merged['publication_year'] = merged['publication_date_final'].dt.year
merged['publication_decade'] = (merged['publication_year'] // 10) * 10


# 7) Consolidate ISBN
merged['isbn_final'] = merged['gb_isbn_clean'].combine_first(
    merged['bbe_isbn_clean']
)


# 8) Consolidate SERIES
merged['series_final'] = merged['gb_series_clean'].combine_first(
    merged['bbe_series_clean']
)


# 9) Consolidate PUBLISHER
merged['publisher_final'] = merged['gb_publisher_clean'].combine_first(
    merged['bbe_publisher_clean']
)


# 10) Consolidate PAGES
# gb_pages_clean: float
# bbe_pages_clean: float
merged['pages_final'] = merged['gb_pages_clean'].combine_first(
    merged['bbe_pages_clean']
)


# 11) Consolidate GENRES
# objects holding lists or strings
merged['genres_final'] = merged['gb_genres_clean'].combine_first(
    merged['bbe_genres_clean']
)

merged['genres_simple_final'] = merged['gb_genres_simplified'].combine_first(
    merged['bbe_genres_simplified']
)


# 12) Consolidate DESCRIPTION
# gb_description_clean: pandas string dtype
# bbe_description_clean: object dtype
merged['description_final'] = merged['gb_description_clean'].combine_first(
    merged['bbe_description_clean']
)

merged['description_nlp_final'] = merged['gb_description_nlp'].combine_first(
    merged['bbe_description_nlp']
)


# 13) Consolidate AWARDS
# gb_has_award is bool, bbe_has_award is object ("True"/"False"/None)
# normalize  award info
merged['gb_has_award'] = (
    merged['gb_has_award']
        .replace({'True': True, 'False': False, '': None})
        .astype('boolean')
)

merged['bbe_has_award'] = (
    merged['bbe_has_award']
        .replace({'True': True, 'False': False, '': None})
        .astype('boolean')
)

# consolidate
merged['has_award_final'] = merged['gb_has_award'].combine_first(
    merged['bbe_has_award']
)

# fill any remaining NA with False (no award)
merged['has_award_final'] = (
    merged['has_award_final']
        .fillna(False)
        .astype('boolean')
)

# 14) Consolidate MAJOR PUBLISHER FLAG
# gb_is_major_publisher: object ("True"/"False")
# bbe_is_major_publisher: object
merged['gb_is_major_publisher'] = merged['gb_is_major_publisher'].replace({
    'True': True, 'False': False
}).astype('boolean')

merged['bbe_is_major_publisher'] = merged['bbe_is_major_publisher'].replace({
    'True': True, 'False': False
}).astype('boolean')

merged['is_major_publisher_final'] = merged['gb_is_major_publisher'].combine_first(
    merged['bbe_is_major_publisher']
)


# 15) Consolidate PRICE (BBE-only)
merged['external_price'] = merged['bbe_price_clean']
merged['price_flag_final'] = merged['bbe_price_flag']


# 16) Consolidate BOOK FORMAT (BBE-only)
merged['external_bookformat'] = merged['bbe_bookFormat_clean']

# 17). Consolidate EXTERNAL RATINGS (BBE-only)
merged['external_rating'] = merged['bbe_rating_clean']
merged['external_numratings'] = merged['bbe_numRatings_clean']
merged['external_votes'] = merged['bbe_votes_clean']
merged['external_score'] = merged['bbe_score_clean']
merged['external_likedpct'] = merged['bbe_likedPercent_clean']


# 18) Consolidate EXTERNAL RATING DISTRIBUTIONS (BBE-only)
bbe_rating_dist_cols = [
    'bbe_ratings_1', 'bbe_ratings_2', 'bbe_ratings_3',
    'bbe_ratings_4', 'bbe_ratings_5',
    'bbe_ratings_1_share', 'bbe_ratings_2_share',
    'bbe_ratings_3_share', 'bbe_ratings_4_share',
    'bbe_ratings_5_share'
]

for col in bbe_rating_dist_cols:
    merged[f'external_{col}'] = merged[col]

print("Consolidation complete.")
print("Final columns added:", [col for col in merged.columns if col.endswith('_final')])
print("Final dataset shape:", merged.shape)
print("Dataset columns:")
print(merged.columns.tolist())

Consolidation complete.
Final columns added: ['title_final', 'author_final', 'authors_list_final', 'language_final', 'publication_date_final', 'isbn_final', 'series_final', 'publisher_final', 'pages_final', 'genres_final', 'genres_simple_final', 'description_final', 'description_nlp_final', 'has_award_final', 'is_major_publisher_final', 'price_flag_final']
Final dataset shape: (9761, 108)
Dataset columns:
['gb_book_id', 'gb_work_text_reviews_count', 'gb_ratings_1', 'gb_ratings_2', 'gb_ratings_3', 'gb_ratings_4', 'gb_ratings_5', 'goodreads_id_clean', 'gb_best_book_id_clean', 'gb_work_id_clean', 'gb_authors_list', 'gb_author_clean', 'gb_language_clean', 'gb_publication_date_clean', 'gb_isbn_clean', 'isbn_standard', 'gb_rating_clean', 'gb_numRatings_clean', 'gb_numRatings_log', 'gb_ratings_1_share', 'gb_ratings_2_share', 'gb_ratings_3_share', 'gb_ratings_4_share', 'gb_ratings_5_share', 'gb_work_text_reviews_log', 'gb_series_clean', 'gb_title_clean', 'gb_pages_clean', 'gb_genres_clean', 'g

In [38]:
cols_to_drop_warm = [

    # INTERNAL LEAKAGE FIELDS
    'gb_ratings_1', 'gb_ratings_2', 'gb_ratings_3',
    'gb_ratings_4', 'gb_ratings_5',
    'gb_ratings_1_share', 'gb_ratings_2_share',
    'gb_ratings_3_share', 'gb_ratings_4_share',
    'gb_ratings_5_share',
    'gb_work_text_reviews_log',
    'gb_work_text_reviews_count',

    # REDUNDANT CONSOLIDATION INPUTS
    'gb_title_clean', 'bbe_title_clean',
    'gb_primary_author', 'bbe_primary_author',
    'gb_author_clean', 'bbe_author_clean',
    'gb_authors_list', 'bbe_authors_list',
    'gb_language_clean', 'bbe_language_clean',
    'gb_publication_date_clean', 'bbe_publication_date_clean',
    'gb_series_clean', 'bbe_series_clean',
    'gb_publisher_clean', 'bbe_publisher_clean',
    'gb_pages_clean', 'bbe_pages_clean',
    'gb_genres_clean', 'bbe_genres_clean',
    'gb_genres_simplified', 'bbe_genres_simplified',
    'gb_description_clean', 'bbe_description_clean',
    'gb_description_nlp', 'bbe_description_nlp',
    'gb_is_major_publisher', 'bbe_is_major_publisher',
    'gb_has_award', 'bbe_has_award',
    'gb_isbn_clean', 'bbe_isbn_clean',
    
    # REDUNDANT BBE FLAGS AND FIELDS
    'bbe_price_clean',
    'bbe_has_likedPercent',
    'bbe_bookFormat_clean',

    # IDENTIFIERS
    'gb_best_book_id_clean',
    'gb_work_id_clean',
    'isbn_final',

    # NON-PREDICTIVE CLEANING ARTIFACTS
    'price_flag_final',
    'bbe_price_flag', 

    # DUPLICATES OF external_* FEATURES → drop original BBE fields
    'bbe_rating_clean',
    'bbe_numRatings_clean',
    'bbe_votes_clean',
    'bbe_score_clean',
    'bbe_likedPercent_clean',
    'bbe_numRatings_log',
    
    # Drop original BBE rating distribution columns (now have external_* versions)
    'bbe_ratings_1', 'bbe_ratings_2', 'bbe_ratings_3',
    'bbe_ratings_4', 'bbe_ratings_5',
    'bbe_ratings_1_share', 'bbe_ratings_2_share',
    'bbe_ratings_3_share', 'bbe_ratings_4_share',
    'bbe_ratings_5_share',
]


In [39]:
cols_to_drop_cold = cols_to_drop_warm + [

    # REMOVE engineered external_* features
    *[col for col in merged.columns if col.startswith('external_')],
]

In [40]:
merged_clean_warm = merged.drop(columns=cols_to_drop_warm, errors='ignore')
print("Dropped", len(cols_to_drop_warm), "columns for WARM START dataset.")
print("Final WARM START dataset shape:", merged_clean_warm.shape)

merged_clean_cold = merged.drop(columns=cols_to_drop_cold, errors='ignore')
print("Dropped", len(cols_to_drop_cold), "columns for COLD START dataset.")
print("Final COLD START dataset shape:", merged_clean_cold.shape)

print("Columns in WARM START but not in COLD START:")
print(set(merged_clean_warm.columns) - set(merged_clean_cold.columns))

Dropped 68 columns for WARM START dataset.
Final WARM START dataset shape: (9761, 40)
Dropped 85 columns for COLD START dataset.
Final COLD START dataset shape: (9761, 23)
Columns in WARM START but not in COLD START:
{'external_rating', 'external_score', 'external_bbe_ratings_3', 'external_bbe_ratings_1_share', 'external_bbe_ratings_2_share', 'external_numratings', 'external_votes', 'external_bookformat', 'external_bbe_ratings_5_share', 'external_price', 'external_bbe_ratings_5', 'external_bbe_ratings_3_share', 'external_bbe_ratings_2', 'external_likedpct', 'external_bbe_ratings_4', 'external_bbe_ratings_1', 'external_bbe_ratings_4_share'}


## Finalizing and Saving Datasets

Both datasets use consolidated `_final` columns (title, author, pages, etc.) where Goodbooks takes precedence and BBE fills gaps.

In [41]:
from pathlib import Path

# create output directory
output_path = Path("outputs/datasets/cleaned")
output_path.mkdir(parents=True, exist_ok=True)

# save WARM START dataset (includes external BBE signals)
warm_file = output_path / "model_dataset_warm_start.csv"
merged_clean_warm.to_csv(warm_file, index=False)
print(f"WARM START dataset saved: {warm_file}")

# save COLD START dataset (no external signals)
cold_file = output_path / "model_dataset_cold_start.csv"
merged_clean_cold.to_csv(cold_file, index=False)
print(f"\nCOLD START dataset saved: {cold_file}")

# rating clean
ratings_clean.to_csv(output_path / "ratings_clean.csv", index=False)
print(f"\nRatings clean dataset saved: {output_path / 'ratings_clean.csv'}")

# summary
print("\nDATASETS SAVED SUCCESSFULLY")
print("=" * 60)
print(f"\nWARM START: {merged_clean_warm.shape[0]} books, {merged_clean_warm.shape[1]} features")
print(f"COLD START: {merged_clean_cold.shape[0]} books, {merged_clean_cold.shape[1]} features")
print(f"\nFeature difference: {merged_clean_warm.shape[1] - merged_clean_cold.shape[1]} external signals")
print(f"\nRatings clean dataset: {ratings_clean.shape[0]} entries, {ratings_clean.shape[1]} features")

WARM START dataset saved: outputs\datasets\cleaned\model_dataset_warm_start.csv

COLD START dataset saved: outputs\datasets\cleaned\model_dataset_cold_start.csv

Ratings clean dataset saved: outputs\datasets\cleaned\ratings_clean.csv

DATASETS SAVED SUCCESSFULLY

WARM START: 9761 books, 40 features
COLD START: 9761 books, 23 features

Feature difference: 17 external signals

Ratings clean dataset: 5976479 entries, 3 features


# Conclusion

This notebook successfully completed **multi-source data enrichment and dataset integration** for book satisfaction modeling.

### Key Results

**Enrichment Performance:**
- Multi-source strategy: BBE overlap -> OpenLibrary API -> Google Books API
- **95.1%** page count coverage, **100%** publication dates, **99.4%** valid languages
- **94.7%** publisher coverage, **85.2%** description coverage
- **90.7%** genre coverage (from 0% in original Goodbooks)

**Final Datasets:**

| Dataset | Location | Purpose | Records |
|---------|----------|---------|---------|
| `en_supply_catalog.csv` | `outputs/datasets/cleaned/` | BBE supply catalog | 47,452 |
| `en_internal_catalog.csv` | `outputs/datasets/cleaned/` | Enriched Goodbooks | 9,940 |
| `model_dataset_warm_start.csv` | `outputs/datasets/modeling/` | With external signals | 9,940 |
| `model_dataset_cold_start.csv` | `outputs/datasets/modeling/` | Without external signals | 9,940 |

### Next Steps

- **Notebook 04:** Exploratory analysis and genre diversity comparison