# Data Enrichment & Dataset Integration

## Objectives

The purpose of this notebook is to **enrich, align, and integrate the cleaned datasets** to create a unified analytical foundation for modelling book satisfaction and evaluating catalogue diversity.

This notebook expands upon prior cleaning work by **adding missing metadata, linking overlapping records across datasets, filtering the dataset to English-language titles, and preparing a model-ready dataset** that combines catalog-level information (BBE) with user-behavioral data (Goodbooks).

Ultimately, this notebook enables insights that neither dataset could provide independently, most critically, **genre diversity analysis**, **language-based consistency**, **metadata-enhanced prediction modeling**.

---

## Inputs

| Dataset                             | Source                     | Description                                                                                         | Format |
| ----------------------------------- | -------------------------- | --------------------------------------------------------------------------------------------------- | ------ |
| `bbe_clean_v13.csv`                  | Output from Notebook 02    | Cleaned *Best Books Ever* metadata including title, authors, genres, rating, description, and more. | CSV    |
| `books_clean_v7.csv`      | Output from Notebook 02    | Cleaned Goodbooks-10k metadata lacking genre data but containing structural identifiers.            | CSV    |
| `ratings_clean_v1.csv`    | Output from Notebook 02    | User–book interaction and aggregated rating data for behavioral modeling.                           | CSV    |
| *(Optional)* External API responses | OpenLibrary / Google Books | Supplemental metadata (genres, languages, subjects) for non-overlapping titles.                     | JSON   |

---

## Tasks in This Notebook

This notebook will execute the following enrichment and integration steps:

1. **Standardize linking identifiers**
   Normalize `isbn_clean`, `goodreads_id`, `title_clean`, and `author_clean` across datasets to ensure reliable cross-dataset merging.

2. **Identify overlap between BBE and Goodbooks**
   Detect books present in both datasets using multi-key matching and evaluate match quality.

3. **Enrich Goodbooks metadata with missing genres**

   * Use BBE genre fields for overlapping titles.
   * Query external APIs for non-overlapping titles.
   * Normalize all genre outputs into a unified taxonomy.

4. **Complete and standardize language metadata**
   Fill missing values using BBE, APIs, or text-based heuristics, then harmonize language labels and codes.

5. **Filter the enriched datasets to English-language books**
   Restrict the unified dataset to titles identified as **English-language**, ensuring consistency for:

   * genre diversity comparisons
   * ratings behavior
   * regression modeling

   *(Non-English titles will be kept only in the enriched BBE/Goodbooks outputs, but excluded from the model dataset.)*

6. **Integrate datasets into a unified model-ready schema**
   Combine BBE metadata with Goodbooks behavioral features for all overlapping **English-language** books.

7. **Validate enrichment and filtering results**

   * Assess genre and language fill rates
   * Review API match and success metrics
   * Log all imputation and filtering decisions for reproducibility

8. **Export enriched and unified datasets**
   Produce final English-filtered datasets ready for modeling and analysis.

---

## Outputs

* **BBE_clean_enriched.csv** — enriched metadata for all BBE books
* **Goodbooks_books_clean_enriched.csv** — enriched metadata for all Goodbooks books
* **model_dataset_overlap_en_only.csv** — unified metadata + behavioral dataset filtered to English-language books
* **Enrichment and filtering logs** — documenting imputation sources, API usage, and filtering decisions

> **Note:** This notebook focuses on **metadata enrichment, English-language filtering, and dataset integration**. Model development and feature engineering will be performed in later notebooks.

# Set up

## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [1]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

Current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics\notebooks


To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [2]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

Changed directory to parent.
New current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics


In [16]:
from src.cleaning.utils.categories import (
    map_subjects_to_genres
)
from src.cleaning.utils.pipeline import apply_cleaners_selectively

In [None]:
# Install additional packages for this notebook
! pip install requests python-dotenv tqdm

## Load and Inspect Datasets

In this step, we load the previously cleaned datasets: **Goodbooks-10k** (books, ratings) and **Best Books Ever**. 

In [4]:
import pandas as pd 

# load datasets
books_clean = pd.read_csv(
    'data/interim/goodbooks/books_clean_v7.csv',
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
    )
ratings_clean = pd.read_csv('data/interim/goodbooks/ratings_clean_v1.csv')
bbe_clean = pd.read_csv(
    "data/interim/bbe/bbe_clean_v13.csv",
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
)

# create copies for imputation
books_impute = books_clean.copy()
ratings_impute = ratings_clean.copy()
bbe_impute = bbe_clean.copy()

# log samples
print("BBE dataset columns:")
print(bbe_impute.columns.tolist())
print("BBE dataset info:")
display(bbe_impute.info())
print("BBE dataset sample:")
display(bbe_impute.head(3))

print("Books dataset columns:")
print(books_impute.columns.tolist())
print("Books dataset info:")
display(books_impute.info())
print("Books dataset sample:")
display(books_impute.head(3))

print("Ratings dataset columns:")
print(ratings_impute.columns.tolist())
print("Ratings dataset info:")
display(ratings_impute.info())
print("Ratings dataset sample:")
display(ratings_impute.head(3))

BBE dataset columns:
['goodreads_id_clean', 'authors_list', 'author_clean', 'title_clean', 'isbn_clean', 'language_clean', 'publication_date_clean', 'publisher_clean', 'is_major_publisher', 'bookFormat_clean', 'rating_clean', 'numRatings_clean', 'numRatings_log', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'ratings_1_share', 'ratings_2_share', 'ratings_3_share', 'ratings_4_share', 'ratings_5_share', 'has_award', 'genres_clean', 'genres_simplified', 'description_clean', 'description_nlp', 'series_clean', 'pages_clean', 'bbeVotes_clean', 'bbeScore_clean', 'likedPercent_clean', 'has_likedPercent', 'price_clean', 'price_flag']
BBE dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52424 entries, 0 to 52423
Data columns (total 36 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   goodreads_id_clean      52424 non-null  string 
 1   authors_list            52424 non-null  object 
 2   auth

None

BBE dataset sample:


Unnamed: 0,goodreads_id_clean,authors_list,author_clean,title_clean,isbn_clean,language_clean,publication_date_clean,publisher_clean,is_major_publisher,bookFormat_clean,...,description_clean,description_nlp,series_clean,pages_clean,bbeVotes_clean,bbeScore_clean,likedPercent_clean,has_likedPercent,price_clean,price_flag
0,2767052,['suzanne collins'],suzanne collins,the hunger games,9780439023481.0,en,2008-09-14,scholastic,True,hardcover,...,winning means fame and fortunelosing means cer...,winning means fame and fortunelosing means cer...,the hunger games,374.0,30516,2993816,96.0,1,5.09,False
1,2,"['jk rowling', 'mary grandpre']","jk rowling, mary grandpre",harry potter and the order of the phoenix,9780439358071.0,en,2003-06-21,scholastic,True,paperback,...,there is a door at the end of a silent corrido...,there is a door at the end of a silent corrido...,harry potter,870.0,26923,2632233,98.0,1,7.38,False
2,2657,['harper lee'],harper lee,to kill a mockingbird,,en,,harpercollins,True,paperback,...,the unforgettable novel of a childhood in a sl...,the unforgettable novel of a childhood in a sl...,to kill a mockingbird,324.0,23328,2269402,95.0,1,,True


Books dataset columns:
['book_id', 'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'goodreads_id_clean', 'best_book_id_clean', 'work_id_clean', 'authors_list', 'author_clean', 'language_clean', 'publication_date_clean', 'isbn_clean', 'isbn13_clean', 'rating_clean', 'numRatings_clean', 'numRatings_log', 'ratings_1_share', 'ratings_2_share', 'ratings_3_share', 'ratings_4_share', 'ratings_5_share', 'work_text_reviews_log', 'series_clean', 'title_clean']
Books dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   book_id                  10000 non-null  int64  
 1   work_text_reviews_count  10000 non-null  int64  
 2   ratings_1                10000 non-null  int64  
 3   ratings_2                10000 non-null  int64  
 4   ratings_3                10000 non-null  int

None

Books dataset sample:


Unnamed: 0,book_id,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,goodreads_id_clean,best_book_id_clean,work_id_clean,...,numRatings_clean,numRatings_log,ratings_1_share,ratings_2_share,ratings_3_share,ratings_4_share,ratings_5_share,work_text_reviews_log,series_clean,title_clean
0,1,155254,66715,127936,560092,1481305,2706317,2767052,2767052,2792775,...,4942365,15.413355,0.013499,0.025886,0.113325,0.299716,0.547575,11.952824,the hunger games,the hunger games
1,2,75867,75504,101676,455024,1156318,3011543,3,3,4640799,...,4800065,15.38414,0.01573,0.021182,0.094795,0.240896,0.627396,11.23675,harry potter,harry potter and the sorcerer's stone
2,3,95009,456191,436802,793319,875073,1355439,41865,41865,3212258,...,3916824,15.180792,0.11647,0.111519,0.202541,0.223414,0.346056,11.461737,twilight,twilight


Ratings dataset columns:
['user_id', 'book_id', 'rating']
Ratings dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5976479 entries, 0 to 5976478
Data columns (total 3 columns):
 #   Column   Dtype
---  ------   -----
 0   user_id  int64
 1   book_id  int64
 2   rating   int64
dtypes: int64(3)
memory usage: 136.8 MB


None

Ratings dataset sample:


Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5


# Data Enrichment

## Enriching Goodbooks with Genre and Page Count

### From BBE overlap

To improve the completeness and quality of the Goodbooks-10k dataset, we selectively merge in metadata from the Best Books Ever (BBE) dataset using the shared `goodreads_id_clean` key. Goodbooks is kept as the primary source, while BBE is used to supply additional metadata fields, such as genres and page counts, as well as to fill in missing values for shared attributes like ISBN, publication date, and series.

This approach ensures we enhance Goodbooks only where necessary: adding new information where it is absent and completing incomplete entries without overwriting existing data. The resulting `gb_enriched` dataset combines both sources into a more reliable and feature-rich foundation for downstream analytics and modeling.


In [5]:
# ---------------------------------------------
# ENRICH GOODBOOKS (books_impute) WITH BBE DATA
# ---------------------------------------------

import pandas as pd

# columns to enrich ONLY when GB has NaN
columns_to_enrich = [
    "publication_date_clean",
    "series_clean",
    "isbn_clean",
    "language_clean"
    ]

# columns existent only in BBE
bbe_only_columns = [
    "pages_clean",
    "genres_clean",
    "genres_simplified",
    "publisher_clean",
    "is_major_publisher",
    "has_award",
    "description_clean",
    "description_nlp"
]

# merge Goodbooks with the needed BBE columns
merge_cols = ["goodreads_id_clean"] + columns_to_enrich + bbe_only_columns

gb_enriched = books_impute.merge(
    bbe_impute[merge_cols].add_suffix("_bbe"),
    left_on="goodreads_id_clean",
    right_on="goodreads_id_clean_bbe",
    how="left"
)

# ---------------------------------------------
# ENRICH GENRE COLUMNS
# ---------------------------------------------
print("\n--- ENRICHING METADATA ---")
for col in bbe_only_columns:
    gb_enriched[col] = gb_enriched[col + "_bbe"]
    filled = gb_enriched[col].notna().sum()
    print(f"{col}: filled {filled} rows from BBE")

# ---------------------------------------------
# ENRICH SHARED COLUMNS ONLY WHERE GB IS NaN
# ---------------------------------------------
print("\n--- ENRICHING SHARED COLUMNS (GB NaN -> fill from BBE) ---")
for col in columns_to_enrich:
    before = gb_enriched[col].isna().sum()
    gb_enriched[col] = gb_enriched[col].fillna(gb_enriched[col + "_bbe"])
    after = gb_enriched[col].isna().sum()
    print(f"{col}: filled {before - after} missing values")

# ---------------------------------------------
# CLEANUP
# ---------------------------------------------
gb_enriched = gb_enriched.drop(columns=[c for c in gb_enriched.columns if c.endswith("_bbe")])

print("\nEnrichment complete!")
print("Final shape:", gb_enriched.shape)
gb_enriched[['isbn_clean','title_clean', 'series_clean', 'genres_clean', 'genres_simplified', 'pages_clean', 'publication_date_clean']].head()


--- ENRICHING METADATA ---
pages_clean: filled 8053 rows from BBE
genres_clean: filled 8082 rows from BBE
genres_simplified: filled 8082 rows from BBE
publisher_clean: filled 7954 rows from BBE
is_major_publisher: filled 8082 rows from BBE
has_award: filled 8082 rows from BBE
description_clean: filled 8009 rows from BBE
description_nlp: filled 8009 rows from BBE

--- ENRICHING SHARED COLUMNS (GB NaN -> fill from BBE) ---
publication_date_clean: filled 68 missing values
series_clean: filled 1133 missing values
isbn_clean: filled 984 missing values
language_clean: filled 684 missing values

Enrichment complete!
Final shape: (10000, 35)


Unnamed: 0,isbn_clean,title_clean,series_clean,genres_clean,genres_simplified,pages_clean,publication_date_clean
0,439023483.0,the hunger games,the hunger games,"['young adult', 'fiction', 'dystopia', 'fantas...","['young adult', 'fiction', 'dystopia', 'fantas...",374.0,2008-01-01
1,439554934.0,harry potter and the sorcerer's stone,harry potter,"['fantasy', 'fiction', 'young adult', 'magic',...","['fantasy', 'fiction', 'young adult', 'magic',...",309.0,1997-01-01
2,316015849.0,twilight,twilight,"['young adult', 'fantasy', 'romance', 'vampire...","['young adult', 'fantasy', 'romance', 'vampire...",501.0,2005-01-01
3,,to kill a mockingbird,to kill a mockingbird,"['classics', 'fiction', 'historical fiction', ...","['classics', 'fiction', 'historical fiction', ...",324.0,1960-01-01
4,743273567.0,the great gatsby,,"['classics', 'fiction', 'school', 'literature'...","['classics', 'fiction', 'school', 'literature'...",200.0,1925-01-01


In [6]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/cleaned/merge")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 1

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v1 saved successfully in data/interim/merge directory.


### From external APIs

To further enrich the Goodbooks-10k dataset, we leverage external APIs such as OpenLibrary and Google Books to fill in missing metadata for titles not covered by the BBE overlap. This process involves querying these APIs using available identifiers (like ISBN or title/author combinations) to retrieve additional information such as genres, page counts, and publication details.

In [7]:
import re

def clean_isbn(isbn):
    if not isinstance(isbn, str):
        return None
    isbn = re.sub(r'[^0-9Xx]', '', isbn)
    if len(isbn) in [10, 13]:
        return isbn
    return None

gb_enriched['isbn_query'] = gb_enriched['isbn_clean'].apply(clean_isbn)

In [9]:
missing_mask = (
    gb_enriched['language_clean'].isna() |
    gb_enriched['language_clean'].isin(['unknown', '', 'None']) |
    gb_enriched['pages_clean'].isna() |
    gb_enriched['publication_date_clean'].isna()  |
    gb_enriched['publisher_clean'].isna() |
    gb_enriched['description_clean'].isna()
)

to_impute = gb_enriched[missing_mask].copy()
print("Books needing external enrichment:", len(to_impute))

Books needing external enrichment: 2249


#### Querying OpenLibrary API

After enriching Goodbooks with BBE overlap data, we identify **2,249** books still missing critical metadata (ISBN, language, pages, publication date, publisher). We query **OpenLibrary first** because it has no rate limits or API key requirements, making it ideal for bulk enrichment. We create a boolean mask to identify books needing enrichment, then query OpenLibrary's ISBN endpoint for each book, collecting results in a structured format.

The results are merged back into `gb_enriched` and saved as **version 2**. This incremental saving strategy ensures we don't lose progress if subsequent API calls fail or exceed quotas.

In [10]:
import json
from pathlib import Path

# cache path for OpenLibrary in data/raw
OL_CACHE_PATH = Path("data/raw/openlibrary_api_cache.json")

# create directory if it doesn't exist
OL_CACHE_PATH.parent.mkdir(parents=True, exist_ok=True)

# load existing cache if it exists
if OL_CACHE_PATH.exists():
    with open(OL_CACHE_PATH, "r") as f:
        ol_cache = json.load(f)
    print(f"Loaded {len(ol_cache)} cached OpenLibrary entries")
else:
    ol_cache = {}
    print("No existing cache found, starting fresh")

No existing cache found, starting fresh


In [11]:
import requests
import time

def query_openlibrary(isbn):
    """Return OL metadata in a consistent dict format."""

    isbn_str = str(isbn)
    
    if isbn_str in ol_cache:
        return ol_cache[isbn_str]
    
    # Default structure to guarantee stable DataFrame columns
    result = {
        "pages_openlib": None,
        "publication_date_openlib": None,
        "language_openlib": None,
        "subjects_openlib": None,
        "publisher_openlib": None, 
        "description_openlib": None, 
    }

    if isbn is None or pd.isna(isbn) or isbn == "":
        return result
    
    url = f"https://openlibrary.org/isbn/{isbn}.json"

    try:
        r = requests.get(url, timeout=10)
        time.sleep(0.2)

        if r.status_code != 200:
            return result

        data = r.json()

        # Pages
        result["pages_openlib"] = data.get("number_of_pages")

        # Publication date
        result["publication_date_openlib"] = data.get("publish_date")

        # Language
        if "languages" in data and isinstance(data["languages"], list):
            key = data["languages"][0].get("key", "").split("/")[-1]
            result["language_openlib"] = key

        # Subjects
        if "subjects" in data:
            result["subjects_openlib"] = [s.lower() for s in data["subjects"]]
        
        # Publisher
        if "publishers" in data and isinstance(data["publishers"], list):
            result["publisher_openlib"] = data["publishers"][0]
        
        # Description
        desc = data.get("description")
        if isinstance(desc, dict):
            result["description_openlib"] = desc.get("value")
        elif isinstance(desc, str):
            result["description_openlib"] = desc


    except Exception as e:
        pass  # keep the default result structure

    # Save to cache
    ol_cache[isbn_str] = result
    return result


In [12]:
import time
from tqdm import tqdm

results = []
for isbn in tqdm(to_impute['isbn_query'], desc="Querying OpenLibrary"):
    results.append(query_openlibrary(isbn))
    time.sleep(0.2)   # safe rate limit

Querying OpenLibrary: 100%|██████████| 2249/2249 [42:47<00:00,  1.14s/it] 


In [None]:
import json
from pathlib import Path

# Save OpenLibrary cache after queries
with open(OL_CACHE_PATH, "w") as f:
    json.dump(ol_cache, f, indent=2)
print(f"OpenLibrary cache saved with {len(ol_cache)} entries")

# convert results to dataframe
ol_df = pd.DataFrame(results, index=to_impute.index)
print("API results summary:")
print(ol_df.notna().sum())

# merge back into gb_enriched
for col in ol_df.columns:
    if col not in gb_enriched.columns:
        gb_enriched[col] = None
    gb_enriched.loc[ol_df.index, col] = ol_df[col]

# verify the merge
print("\nAfter merge:")
print(gb_enriched[ol_df.columns].notna().sum())

API results summary:
pages_openlib               1399
publication_date_openlib    1737
language_openlib            1472
subjects_openlib            1046
publisher_openlib           1689
description_openlib          526
dtype: int64

After merge:
pages_openlib               1399
publication_date_openlib    1737
language_openlib            1472
subjects_openlib            1046
publisher_openlib           1689
description_openlib          526
dtype: int64


In [14]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/cleaned/merge")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 2

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v2 saved successfully in data/interim/merge directory.


#### Cleaning and Processing OpenLibrary Data

We apply the same cleaning steps used in Notebook 02, compiled into a pipeline, to standardize OpenLibrary API responses. The `apply_cleaners_selectively()` function ensures consistent data types, formats, and validation across all metadata fields. After cleaning, we fill missing values in `gb_enriched` using the cleaned OpenLibrary data.

For genre enrichment, we map OpenLibrary subjects to our standardized genre taxonomy using `map_subjects_to_genres()`. This populates `genres_simplified` for books that had subjects but no genre data, significantly improving genre coverage. The enriched dataset is saved as **version 3**.

In [18]:
# clean OpenLibrary API data
gb_enriched = apply_cleaners_selectively(
    gb_enriched,
    fields_to_clean=[
        'pages',
        'publication_date',
        'language',
        'subjects',
        'publisher',
        'description'
        ],
    source_suffix='_openlib',
    target_suffix='_openlib_clean',
    inplace=False
)

# verify cleaning
print("\nSample of cleaned OpenLibrary data:")
gb_enriched[[
    'title_clean',
    'pages_clean',
    'pages_openlib',
    'pages_openlib_clean',
    'publication_date_clean',
    'publication_date_openlib',
    'publication_date_openlib_clean',
    'language_clean',
    'language_openlib',
    'language_openlib_clean',
    'genres_clean',
    'genres_simplified',
    'subjects_openlib',
    'subjects_openlib_clean',
    'publisher_clean',
    'description_openlib',
    'description_clean',
    'description_openlib',
    'description_openlib_clean'
    ]].sample(15, random_state=42)


Sample of cleaned OpenLibrary data:


Unnamed: 0,title_clean,pages_clean,pages_openlib,pages_openlib_clean,publication_date_clean,publication_date_openlib,publication_date_openlib_clean,language_clean,language_openlib,language_openlib_clean,genres_clean,genres_simplified,subjects_openlib,subjects_openlib_clean,publisher_clean,description_openlib,description_clean,description_openlib.1,description_openlib_clean
6252,scion of ikshvaku,354.0,,,2015-01-01,,,en,,,"['mythology', 'fiction', 'fantasy', 'indian li...","['mythology', 'fiction', 'fantasy', 'other', '...",,,westland publication,,ram rajya the perfect land but perfection has ...,,
4684,canada,420.0,,,2012-01-01,,,en,,,"['fiction', 'canada', 'literary fiction', 'con...","['fiction', 'other', 'literary fiction', 'cont...",,,harpercollins,,first i'll tell about the robbery our parents ...,,
1731,the man in the brown suit,381.0,,,1924-01-01,,,en,,,"['mystery', 'fiction', 'crime', 'classics', 'm...","['mystery', 'fiction', 'crime', 'classics', 'm...",,,harpercollins,,newly-orphaned anne beddingfeld is a nice engl...,,
4742,twilight and philosophy vampires vegetarians a...,259.0,,,2009-01-01,,,en,,,"['philosophy', 'nonfiction', 'vampires', 'essa...","['philosophy', 'nonfiction', 'vampires', 'essa...",,,wiley,,the first look at the philosophy behind stephe...,,
4521,saga vol 5,,152.0,152.0,2015-01-01,"September 15, 2015",2015-09-15,en,eng,en,,,"[military deserters, parents of exceptional ch...","[military deserters, parents of exceptional ch...",,,,,
6340,asterix the gaul,48.0,,,1960-01-01,,,en,,,"['comics', 'graphic novels', 'bande dessine', ...","['comics', 'graphic novels', 'other', 'fiction...",,,"orion books ltd, london",,the year is 50 bc and all gaul is occupied onl...,,
576,tuck everlasting,148.0,,,1975-01-01,,,en,,,"['fantasy', 'young adult', 'classics', 'fictio...","['fantasy', 'young adult', 'classics', 'fictio...",,,macmillan,,doomed to - or blessed with - eternal life aft...,,
5202,domes of fire,,480.0,480.0,1992-01-01,"May 29, 1993",1993-05-29,,eng,en,,,"[fiction - fantasy, fiction, fantasy, fantasy ...","[fiction - fantasy, fiction, fantasy, fantasy ...",,,,,
6363,when we were orphans,,320.0,320.0,2000-01-01,"March 3, 2005",2005-03-03,en,,,,,"[modern fiction, fiction]","[modern fiction, fiction]",,,,,
439,fall of giants,985.0,,,2010-01-01,,,en,,,"['historical fiction', 'fiction', 'historical'...","['historical fiction', 'fiction', 'historical'...",,,penguin random house,,this is an epic of love hatred war and revolut...,,


In [19]:
# fill missing values with cleaned OpenLibrary data
print("\n--- Filling missing values with cleaned OpenLibrary data ---")

# fill pages_clean
before_pages = gb_enriched['pages_clean'].isna().sum()
gb_enriched['pages_clean'] = gb_enriched['pages_clean'].fillna(gb_enriched['pages_openlib_clean'])
after_pages = gb_enriched['pages_clean'].isna().sum()
print(f"pages_clean: filled {before_pages - after_pages} values")

# fill publication_date_clean
before_date = gb_enriched['publication_date_clean'].isna().sum()
gb_enriched['publication_date_clean'] = gb_enriched['publication_date_clean'].fillna(gb_enriched['publication_date_openlib_clean'])
after_date = gb_enriched['publication_date_clean'].isna().sum()
print(f"publication_date_clean: filled {before_date - after_date} values")

# fill language_clean
# Create mask that catches both NaN and invalid string values
before_lang = (gb_enriched['language_clean'].isna() | 
               gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()

mask = (gb_enriched['language_clean'].isna() | 
        gb_enriched['language_clean'].isin(['unknown', '', 'None']))

gb_enriched.loc[mask, 'language_clean'] = gb_enriched.loc[mask, 'language_openlib_clean']

after_lang = (gb_enriched['language_clean'].isna() | 
              gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()

print(f"language_clean: filled {before_lang - after_lang} values")

# fill publisher_clean
print("\n--- Filling missing publisher_clean using OpenLibrary data ---")
before_publisher = gb_enriched['publisher_clean'].isna().sum()
gb_enriched['publisher_clean'] = gb_enriched['publisher_clean'].fillna(
    gb_enriched['publisher_openlib_clean']
)
after_publisher = gb_enriched['publisher_clean'].isna().sum()
print(f"publisher_clean: filled {before_publisher - after_publisher} values")

# fill description_clean
print("\n--- Filling missing description_clean using OpenLibrary data ---")

before_desc = gb_enriched['description_clean'].isna().sum()
gb_enriched['description_clean'] = gb_enriched['description_clean'].fillna(
    gb_enriched['description_openlib_clean']
)
after_desc = gb_enriched['description_clean'].isna().sum()
print(f"description_clean: filled {before_desc - after_desc} values")

# generate genres_simplified from subjects_openlib_clean
print("\n--- Generating genres_simplified from OpenLibrary subjects ---")

# Import genre mapping utilities

# Fill genres_simplified for books that have subjects but no genres
books_needing_genre_mapping = (
    (gb_enriched['genres_simplified'].isna()) & 
    (gb_enriched['subjects_openlib_clean'].notna())
)

print(f"Books with subjects but no genres_simplified: {books_needing_genre_mapping.sum()}")

if books_needing_genre_mapping.sum() > 0:
    # Apply genre mapping to subjects
    gb_enriched.loc[books_needing_genre_mapping, 'genres_simplified'] = (
        gb_enriched.loc[books_needing_genre_mapping, 'subjects_openlib_clean']
        .apply(lambda x: map_subjects_to_genres(x) if isinstance(x, list) else None)
    )
    
    filled_genres = (
        gb_enriched.loc[books_needing_genre_mapping, 'genres_simplified'].notna().sum()
    )
    print(f"genres_simplified: mapped {filled_genres} values from OpenLibrary subjects")
    
    # Show sample of newly mapped genres
    newly_mapped = gb_enriched[books_needing_genre_mapping & gb_enriched['genres_simplified'].notna()]
    if len(newly_mapped) > 0:
        print("\nSample of newly mapped genres:")
        print(newly_mapped[['title_clean', 'subjects_openlib_clean', 'genres_simplified']].head(5))



--- Filling missing values with cleaned OpenLibrary data ---
pages_clean: filled 1257 values
publication_date_clean: filled 42 values
language_clean: filled 305 values

--- Filling missing publisher_clean using OpenLibrary data ---
publisher_clean: filled 1515 values

--- Filling missing description_clean using OpenLibrary data ---
description_clean: filled 507 values

--- Generating genres_simplified from OpenLibrary subjects ---
Books with subjects but no genres_simplified: 951
genres_simplified: mapped 796 values from OpenLibrary subjects

Sample of newly mapped genres:
            title_clean                             subjects_openlib_clean  \
29            gone girl  [fiction suspense, fiction mystery detective g...   
32  memoirs of a geisha  [geishas -- fiction, women -- japan -- fiction...   
43         the notebook                          [modern fiction, fiction]   
47       fahrenheit 451  [bradbury ray - prose criticism, spanishcontem...   
70         frankenstein  [fra

In [20]:
print("\n--- ENRICHMENT SUMMARY ---")

# Show books that received OpenLibrary data
books_with_ol_data = gb_enriched[
    gb_enriched['subjects_openlib_clean'].notna()
].copy()

print(f"\nTotal books enriched with OpenLibrary data: {len(books_with_ol_data)}")

if len(books_with_ol_data) > 0:
    print("\nSample of books enriched from OpenLibrary:")
    display(books_with_ol_data[[
        'title_clean',
        'author_clean',
        'pages_openlib_clean',
        'publication_date_openlib_clean',
        'language_openlib_clean',
        'subjects_openlib_clean',
        'genres_simplified',
        'publisher_clean',
        'description_clean'
    ]].head(10))

# Show overall genre coverage
print(f"\n--- GENRE COVERAGE AFTER ENRICHMENT ---")
print(f"Books with genres_clean: {gb_enriched['genres_clean'].notna().sum()}")
print(f"Books without genres_clean: {gb_enriched['genres_clean'].isna().sum()}")
print(f"Genre coverage: {gb_enriched['genres_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%\n")
print(f"Books with genres_simplified: {gb_enriched['genres_simplified'].notna().sum()}")
print(f"Books without genres_simplified: {gb_enriched['genres_simplified'].isna().sum()}")
print(f"Genre simplified coverage: {gb_enriched['genres_simplified'].notna().sum() / len(gb_enriched) * 100:.1f}%")
# Show overall description and publisher coverage
print(f"Books with description_clean: {gb_enriched['description_clean'].notna().sum()}")
print(f"Books without description_clean: {gb_enriched['description_clean'].isna().sum()}")
print(f"Books description coverage: {gb_enriched['description_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%")
print(f"Books with publisher_clean: {gb_enriched['publisher_clean'].notna().sum()}")
print(f"Books without publisher_clean: {gb_enriched['publisher_clean'].isna().sum()}")
print(f"Books publisher coverage: {gb_enriched['publisher_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%")


--- ENRICHMENT SUMMARY ---

Total books enriched with OpenLibrary data: 1046

Sample of books enriched from OpenLibrary:


Unnamed: 0,title_clean,author_clean,pages_openlib_clean,publication_date_openlib_clean,language_openlib_clean,subjects_openlib_clean,genres_simplified,publisher_clean,description_clean
29,gone girl,gillian flynn,399.0,2012-01-01,en,"[fiction suspense, fiction mystery detective g...","[fiction, mystery]",weidenfeld nicolson,just how well can you ever know the person you...
32,memoirs of a geisha,arthur golden,758.0,2005-01-01,en,"[geishas -- fiction, women -- japan -- fiction...","[fiction, historical fiction]",penguin random house,
43,the notebook,nicholas sparks,272.0,2004-07-05,,"[modern fiction, fiction]",[fiction],penguin random house,
47,fahrenheit 451,ray bradbury,176.0,2006-01-03,es,"[bradbury ray - prose criticism, spanishcontem...","[fiction, science fiction, non-fiction]",plaza y janes,
70,frankenstein,"mary wollstonecraft shelley, percy bysshe shel...",273.0,2003-01-01,en,"[frankenstein-- fiction, scientists -- fiction...",[fiction],penguin random house,presents the story of dr frankenstein and his ...
83,jurassic park,michael crichton,467.0,2006-01-01,es,"[suspense, fiction, fiction - general, spanish...",[fiction],debolsillo,
146,thirteen reasons why,jay asher,288.0,2008-01-01,en,"[suicide -- fiction, high schools -- fiction, ...",[fiction],razorbill,when high school student clay jenkins receives...
166,american gods,neil gaiman,672.0,2002-03-04,en,[science fiction],"[fiction, science fiction]",headline book publishing,
173,the shack,william paul young,252.0,2007-01-01,en,"[life change events -- fiction, missing childr...","[fiction, children]",windblown media,mackenzie allen phillips' youngest daughter mi...
194,the guernsey literary and potato peel pie society,"mary ann shaffer, annie barrows",288.0,2008-01-01,en,"[literary, fiction literary, fiction, fiction ...",[fiction],the dial press,i wonder how the book got to guernsey perhaps ...



--- GENRE COVERAGE AFTER ENRICHMENT ---
Books with genres_clean: 8082
Books without genres_clean: 1918
Genre coverage: 80.8%

Books with genres_simplified: 8878
Books without genres_simplified: 1122
Genre simplified coverage: 88.8%
Books with description_clean: 8516
Books without description_clean: 1484
Books description coverage: 85.2%
Books with publisher_clean: 9469
Books without publisher_clean: 531
Books publisher coverage: 94.7%


In [21]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/cleaned/merge")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 3

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v3 saved successfully in data/interim/merge directory.


#### Querying Google Books API (with Quota Management)

After OpenLibrary enrichment, we create a new mask to identify remaining gaps: **1,730** books. Google Books API requires an API key and has daily quota limits (1,000 requests/day for free tier), so we implement several strategies: **(1)** process ISBNs in chunks of 1,000, **(2)** add sleep delays between requests, **(3)** cache all results to avoid re-querying, and **(4)** save progress incrementally.

We load existing cache if available, query only uncached ISBNs, and update the cache after each session. This approach allows us to spread queries across multiple days if needed while preserving all previous work.

In [22]:
# Check how many books still need enrichment
new_missing_mask = (
    gb_enriched['language_clean'].isna() |
    gb_enriched['language_clean'].isin(['unknown', '', 'None']) |
    gb_enriched['pages_clean'].isna() |
    gb_enriched['publication_date_clean'].isna() | 
    gb_enriched['publisher_clean'].isna() |
    gb_enriched['description_clean'].isna()
)

new_to_impute = gb_enriched[new_missing_mask].copy()
print("Books still needing external enrichment:", len(new_to_impute))

# Show breakdown by field
print("\nBreakdown of remaining missing values:")
print(f"  - Missing pages: {gb_enriched['pages_clean'].isna().sum()}")
print(f"  - Missing publication_date: {gb_enriched['publication_date_clean'].isna().sum()}")
print(f"  - Missing/invalid language: {(gb_enriched['language_clean'].isna() | gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()}")
print(f"  - Missing publisher: {gb_enriched['publisher_clean'].isna().sum()}")
print(f"  - Missing description: {gb_enriched['description_clean'].isna().sum()}")  

Books still needing external enrichment: 1730

Breakdown of remaining missing values:
  - Missing pages: 690
  - Missing publication_date: 3
  - Missing/invalid language: 95
  - Missing publisher: 531
  - Missing description: 1484


In [23]:
def chunk_list(data, size=1000):
    for i in range(0, len(data), size):
        yield data[i:i+size]

In [24]:
import json
from pathlib import Path

# Define cache path for Google Books in data/raw
CACHE_PATH = Path("data/raw/google_api_cache.json")

# Create directory if it doesn't exist
CACHE_PATH.parent.mkdir(parents=True, exist_ok=True)

# Load existing cache if it exists
if CACHE_PATH.exists():
    with open(CACHE_PATH, "r") as f:
        google_cache = json.load(f)
else:
    google_cache = {}

In [25]:
import os
import time
import requests
from dotenv import load_dotenv

load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_BOOKS_API_KEY")

def query_google_books(isbn):
    isbn = str(isbn)

    # Check cache
    if isbn in google_cache:
        return google_cache[isbn]

    # Query Google Books with API key
    url = (
        f"https://www.googleapis.com/books/v1/volumes?"
        f"q=isbn:{isbn}&key={GOOGLE_API_KEY}"
    )

    r = requests.get(url)

    if r.status_code != 200:
        result = {"isbn": isbn, "error": f"HTTP {r.status_code}"}
    else:
        data = r.json()
        if "items" in data and data["items"]:
            volume = data["items"][0]["volumeInfo"]
            result = {
                "isbn": isbn,
                "title": volume.get("title"),
                "authors": volume.get("authors"),
                "publisher": volume.get("publisher"),
                "publishedDate": volume.get("publishedDate"),
                "pageCount": volume.get("pageCount"),
                "categories": volume.get("categories"),
                "language": volume.get("language"),
                "description": volume.get("description"),
            }
        else:
            result = {"isbn": isbn, "error": "No results"}

    # Save to cache
    google_cache[isbn] = result
    return result

In [None]:
list_of_isbns = (
    new_to_impute['isbn_query']
    .dropna()
    .astype(str)
    .str.strip()
    .unique()
    .tolist()
)
print(f"Unique ISBNs to query: {len(list_of_isbns)}")
chunks = list(chunk_list(list_of_isbns, size=1000))


Unique ISBNs to query: 1278


In [32]:
from tqdm import tqdm

# Choose which chunk you want to process today
chunk_to_process = chunks[1]   # run chunk 0 today, 1 tomorrow, etc.
results = []
for isbn in tqdm(chunk_to_process, desc="Querying Google Books"):
    results.append(query_google_books(isbn))
    time.sleep(0.1)   # be nice to the API


Querying Google Books: 100%|██████████| 278/278 [02:44<00:00,  1.69it/s]


In [33]:
# Save cache after ISBN queries
with open(CACHE_PATH, "w") as f:
    json.dump(google_cache, f, indent=2)
print(f"Cache updated with {len(google_cache)} entries after ISBN queries")

Cache updated with 1730 entries after ISBN queries


#### Handling Books Without ISBNs

Some books lack valid ISBNs but can still be enriched using **title and author search**. Google Books API supports `intitle:` and `inauthor:` query parameters, allowing us to find books by bibliographic metadata instead of identifiers. We create cache keys in `"title|author"` format to distinguish these from ISBN-based queries.

This fallback strategy significantly increases our enrichment coverage, especially for older books, special editions, or records with ISBN errors. Results are cached alongside ISBN queries to maintain a unified enrichment workflow.

In [30]:

def query_google_books_by_title(title, author):
    """Query Google Books API using title and author when ISBN is unavailable."""
    
    # Create cache key
    cache_key = f"{title}|{author}"
    
    if cache_key in google_cache:
        return google_cache[cache_key]
    
    # Build query string
    query_parts = []
    if pd.notna(title):
        query_parts.append(f'intitle:"{title}"')
    if pd.notna(author):
        query_parts.append(f'inauthor:"{author}"')
    
    query_string = "+".join(query_parts)
    
    url = (
        f"https://www.googleapis.com/books/v1/volumes?"
        f"q={query_string}&key={GOOGLE_API_KEY}"
    )
    
    r = requests.get(url)
    
    if r.status_code != 200:
        result = {"title": title, "author": author, "error": f"HTTP {r.status_code}"}
    else:
        data = r.json()
        if "items" in data and data["items"]:
            volume = data["items"][0]["volumeInfo"]
            result = {
                "title": title,
                "author": author,
                "pageCount": volume.get("pageCount"),
                "publisher": volume.get("publisher"),  
                "publishedDate": volume.get("publishedDate"),
                "categories": volume.get("categories"),
                "language": volume.get("language"),
                "description": volume.get("description"),
            }
        else:
            result = {"title": title, "author": author, "error": "No results"}
    
    google_cache[cache_key] = result
    return result

# Process books without ISBN separately
books_without_isbn = new_to_impute[new_to_impute['isbn_query'].isna()].copy()
print(f"Books without ISBN to query by title/author: {len(books_without_isbn)}")

results_by_title = []
for idx, row in tqdm(books_without_isbn.iterrows(), 
                     total=len(books_without_isbn),
                     desc="Querying Google Books by title/author"):
    results_by_title.append(query_google_books_by_title(
        row['title_clean'], 
        row['author_clean']
    ))
    time.sleep(0.1)

Books without ISBN to query by title/author: 452


Querying Google Books by title/author: 100%|██████████| 452/452 [05:41<00:00,  1.32it/s]


In [31]:
# Save cache after title/author queries
with open(CACHE_PATH, "w") as f:
    json.dump(google_cache, f, indent=2)
print(f"Cache updated with {len(google_cache)} entries after title/author queries")

Cache updated with 1452 entries after title/author queries


#### Loading and Applying Cached Results

The Google Books cache contains results from multiple query sessions, potentially across different days. We load the complete cache and separate ISBN-based results from title/author-based results by checking for the `"|"` delimiter in cache keys. This allows us to apply different matching logic for each result type.

We then merge cached data back into `gb_enriched`, apply the cleaning pipeline to standardize formats, and fill remaining metadata gaps. The `map_subjects_to_genres()` function maps Google Books categories to our genre taxonomy, further increasing genre coverage. This completes our multi-source enrichment strategy.

In [34]:
import json
import pandas as pd
from pathlib import Path

# Load the cache
CACHE_PATH = Path("google_api_cache.json")

with open(CACHE_PATH, "r") as f:
    google_cache = json.load(f)

print(f"Loaded {len(google_cache)} cached entries")

# Separate ISBN-based results from title/author-based results
isbn_results = []
title_author_results = []

for key, value in google_cache.items():
    if "|" in key:  # Title|Author format
        title_author_results.append(value)
    else:  # ISBN format
        isbn_results.append(value)

print(f"Found {len(isbn_results)} ISBN-based results")
print(f"Found {len(title_author_results)} title/author-based results")

# add google books data to dataframe
google_columns = [
    'pageCount_google',
    'publishedDate_google',
    'categories_google',
    'language_google',
    'publisher_google',
    'description_google'
]
for col in google_columns:
    if col not in gb_enriched.columns:
        gb_enriched[col] = None

Loaded 1864 cached entries
Found 1099 ISBN-based results
Found 765 title/author-based results


In [35]:
# merge isbn_results back to gb_enriched
if isbn_results:
    google_isbn_df = pd.DataFrame(isbn_results)
    print("\nISBN results preview:")
    print(google_isbn_df.head())
    
    # Create ISBN mapping
    isbn_to_data = {str(row['isbn']): row for _, row in google_isbn_df.iterrows() 
                    if 'isbn' in row and pd.notna(row.get('isbn'))}
    
    # Update gb_enriched
    books_with_isbn = new_to_impute[new_to_impute['isbn_query'].notna()].copy()
    
    for idx in books_with_isbn.index:
        isbn = str(books_with_isbn.loc[idx, 'isbn_query'])
        if isbn in isbn_to_data:
            result = isbn_to_data[isbn]
            if pd.isna(result.get('error')):
                gb_enriched.loc[idx, 'pageCount_google'] = result.get('pageCount')
                gb_enriched.loc[idx, 'publishedDate_google'] = result.get('publishedDate')
                gb_enriched.loc[idx, 'categories_google'] = result.get('categories')
                gb_enriched.loc[idx, 'language_google'] = result.get('language')
                gb_enriched.loc[idx, 'publisher_google'] = result.get('publisher')
                gb_enriched.loc[idx, 'description_google'] = result.get('description')
    
    print(f"✓ Merged ISBN-based results for {len(isbn_to_data)} books")


# merge title/author results back to gb_enriched

if title_author_results:
    google_title_df = pd.DataFrame(title_author_results)
    print("\nTitle/Author results preview:")
    print(google_title_df.head())
    
    # Recreate books_without_isbn from new_to_impute
    books_without_isbn = new_to_impute[new_to_impute['isbn_query'].isna()].copy()
    
    # Create title|author key mapping
    for i, (idx, row) in enumerate(books_without_isbn.iterrows()):
        if i < len(google_title_df):
            title_author_key = f"{row['title_clean']}|{row['author_clean']}"
            if title_author_key in google_cache:
                result = google_cache[title_author_key]
                if pd.isna(result.get('error')):
                    gb_enriched.loc[idx, 'pageCount_google'] = result.get('pageCount')
                    gb_enriched.loc[idx, 'publishedDate_google'] = result.get('publishedDate')
                    gb_enriched.loc[idx, 'categories_google'] = result.get('categories')
                    gb_enriched.loc[idx, 'language_google'] = result.get('language')
                    gb_enriched.loc[idx, 'publisher_google'] = result.get('publisher')
                    gb_enriched.loc[idx, 'description_google'] = result.get('description')
    
    print(f"✓ Merged title/author-based results for {len(books_without_isbn)} books")

# Verify merge
print("\nGoogle Books data merged:")
for col in google_columns:
    count = gb_enriched[col].notna().sum()
    print(f"  - {col}: {count} values")

# clean Google Books API data
from src.cleaning.utils.pipeline import apply_cleaners_selectively

gb_enriched = apply_cleaners_selectively(
    gb_enriched,
    fields_to_clean=[
        'pageCount',
        'publishedDate',
        'language',
        'categories',
        'publisher',
        'description'
        ],
    source_suffix='_google',
    target_suffix='_google_clean',
    inplace=False
)

# Verify cleaning
print("\nSample of cleaned Google Books data:")
display(gb_enriched[[
    'title_clean',
    'pages_clean',
    'pageCount_google',
    'pageCount_google_clean',
    'publication_date_clean',
    'publishedDate_google',
    'publishedDate_google_clean',
    'language_clean',
    'language_google',
    'language_google_clean',
    'genres_clean',
    'categories_google',
    'categories_google_clean',
    'genres_simplified',
    'publisher_google',
    'publisher_clean',
    'description_google',
    'description_clean',
]].dropna(subset=['pageCount_google_clean', 'language_google_clean'], how='all').sample(min(15, len(gb_enriched)), random_state=42))


ISBN results preview:
            isbn                      title                      authors  \
0  9780451524940  A Midsummer Night's Dream        [William Shakespeare]   
1  9780452284240                Animal Farm              [George Orwell]   
2  9780618346260      The Lord of the Rings  [John Ronald Reuel Tolkien]   
3  9780297859380                        NaN                          NaN   
4  9780061122420        The Monk Downstairs             [Tim Farrington]   

         publisher publishedDate  pageCount  \
0  Signet Classics          1987      212.0   
1          Penguin    2003-05-06      129.0   
2             None          1994     1137.0   
3              NaN           NaN        NaN   
4        HarperOne    2006-05-23        0.0   

                                categories language       error  
0                                  [Drama]       en         NaN  
1                                [Fiction]       en         NaN  
2  [Baggins, Frodo (Fictitious characte

Unnamed: 0,title_clean,pages_clean,pageCount_google,pageCount_google_clean,publication_date_clean,publishedDate_google,publishedDate_google_clean,language_clean,language_google,language_google_clean,genres_clean,categories_google,categories_google_clean,genres_simplified,publisher_google,publisher_clean,description_google,description_clean
4672,the scarecrow,,285.0,285.0,2009-01-01,2014-05-22,2014-05-22,en,en,en,,[Fiction],[fiction],,,,,
1528,the witch of portobello,,33.0,33.0,2006-01-01,2011-04-28,2011-04-28,en,en,en,,[Fiction],[fiction],,,,,
701,the hundred-year-old man who climbed out of th...,,396.0,396.0,2009-01-01,2012,2012-01-01,en,en,en,,[Fiction],[fiction],,Hesperus Press,hesperus press ltd,,
8589,this body of death,,620.0,620.0,2010-01-01,2011-05-26,2011-05-26,en,en,en,,[Fiction],[fiction],,,,,
2242,the good neighbor,,,,2015-01-01,2013,2013-01-01,en,en,en,,,,,,,,
5273,a cook's tour global adventures in extreme cui...,,290.0,290.0,2001-01-01,2002-11-05,2002-11-05,en,en,en,,[Cooking],[cooking],,,,,
2748,i too had a love story,,0.0,,2007-01-01,2009,2009-01-01,en,en,en,,,,,,srishti publishers distributors,,
3739,panda bear panda bear what do you see,,36.0,36.0,2003-01-01,2003-08,2003-08-01,en,en,en,,[Juvenile Fiction],[juvenile fiction],"[fiction, young adult, children]",Macmillan,h holt,,illustrations and rhyming text present ten dif...
7431,the perfect son,,0.0,,2015-01-01,2015-07,2015-07-01,en,en,en,,[Children with disabilities],[children with disabilities],,,,,
3764,anne mccaffrey's dragonflight 1,,134.0,134.0,1991-01-01,1993,1993-01-01,en,en,en,,"[Fantasy comic books, strips, etc]",[fantasy comic books strips etc],,,eclipse books,,


In [36]:
# after cleaning Google Books data, we'll add genre mapping
print("\n--- Generating genres_simplified from Google Books categories ---")

books_needing_google_genre_mapping = (
    (gb_enriched['genres_simplified'].isna()) & 
    (gb_enriched['categories_google_clean'].notna())
)

print(f"Books with Google categories but no genres_simplified: {books_needing_google_genre_mapping.sum()}")

if books_needing_google_genre_mapping.sum() > 0:
    gb_enriched.loc[books_needing_google_genre_mapping, 'genres_simplified'] = (
        gb_enriched.loc[books_needing_google_genre_mapping, 'categories_google_clean']
        .apply(lambda x: map_subjects_to_genres(x) if isinstance(x, list) else None)
    )
    
    filled_genres = (
        gb_enriched.loc[books_needing_google_genre_mapping, 'genres_simplified'].notna().sum()
    )
    print(f"genres_simplified: mapped {filled_genres} values from Google Books categories")


# fill missing values with cleaned Google Books data
print("\n--- Filling missing values with cleaned Google Books data ---")

# Fill pages_clean
print("\n--- Filling remaining page_clean using Google Books data ---")
before_pages = gb_enriched['pages_clean'].isna().sum()
gb_enriched['pages_clean'] = gb_enriched['pages_clean'].fillna(gb_enriched['pageCount_google_clean'])
after_pages = gb_enriched['pages_clean'].isna().sum()
print(f"pages_clean: filled {before_pages - after_pages} values from Google Books")

# Fill publication_date_clean
print("\n--- Filling remaining publication_date_clean using Google Books data ---")
before_date = gb_enriched['publication_date_clean'].isna().sum()
gb_enriched['publication_date_clean'] = gb_enriched['publication_date_clean'].fillna(gb_enriched['publishedDate_google_clean'])
after_date = gb_enriched['publication_date_clean'].isna().sum()
print(f"publication_date_clean: filled {before_date - after_date} values from Google Books")

# Fill language_clean
print("\n--- Filling remaining language_clean using Google Books data ---")
before_lang = (gb_enriched['language_clean'].isna() | 
               gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()
mask = (gb_enriched['language_clean'].isna() | 
        gb_enriched['language_clean'].isin(['unknown', '', 'None']))
gb_enriched.loc[mask, 'language_clean'] = gb_enriched.loc[mask, 'language_google_clean']
after_lang = (gb_enriched['language_clean'].isna() | 
              gb_enriched['language_clean'].isin(['unknown', '', 'None'])).sum()

print(f"language_clean: filled {before_lang - after_lang} values from Google Books")

# Fill publisher_clean
print("\n--- Filling remaining publisher_clean using Google Books data ---")
before_publisher = gb_enriched['publisher_clean'].isna().sum()
gb_enriched['publisher_clean'] = gb_enriched['publisher_clean'].fillna(
    gb_enriched['publisher_google_clean']
)
after_publisher = gb_enriched['publisher_clean'].isna().sum()
print(f"publisher_clean: filled {before_publisher - after_publisher} values from Google Books")


# Fill description_clean
print("\n--- Filling remaining description_clean using Google Books data ---")
before_desc_google = gb_enriched['description_clean'].isna().sum()
gb_enriched['description_clean'] = gb_enriched['description_clean'].fillna(
    gb_enriched['description_google_clean']
)
after_desc_google = gb_enriched['description_clean'].isna().sum()

print(f"description_clean (Google): filled {before_desc_google - after_desc_google} values")

# final enrichment summary
print("\n--- FINAL ENRICHMENT SUMMARY (ALL SOURCES) ---")

print(f"\nTotal books enriched with Google Books data: {gb_enriched[gb_enriched['categories_google_clean'].notna()].shape[0]}")

print("\n--- FINAL METADATA COVERAGE ---")
print(f"Books with pages_clean: {gb_enriched['pages_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['pages_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with publication_date_clean: {gb_enriched['publication_date_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['publication_date_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with publisher_clean: {gb_enriched['publisher_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['publisher_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with description_clean: {gb_enriched['description_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['description_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
valid_language = gb_enriched['language_clean'].notna() & ~gb_enriched['language_clean'].isin(['unknown', '', 'None'])
print(f"Books with valid language_clean: {valid_language.sum()} / {len(gb_enriched)} ({valid_language.sum() / len(gb_enriched) * 100:.1f}%)")

print(f"\n--- FINAL GENRE COVERAGE ---")
print(f"Books with genres_clean: {gb_enriched['genres_clean'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['genres_clean'].notna().sum() / len(gb_enriched) * 100:.1f}%)")
print(f"Books with genres_simplified: {gb_enriched['genres_simplified'].notna().sum()} / {len(gb_enriched)} ({gb_enriched['genres_simplified'].notna().sum() / len(gb_enriched) * 100:.1f}%)")


--- Generating genres_simplified from Google Books categories ---
Books with Google categories but no genres_simplified: 257
genres_simplified: mapped 194 values from Google Books categories

--- Filling missing values with cleaned Google Books data ---

--- Filling remaining page_clean using Google Books data ---
pages_clean: filled 202 values from Google Books

--- Filling remaining publication_date_clean using Google Books data ---
publication_date_clean: filled 0 values from Google Books

--- Filling remaining language_clean using Google Books data ---
language_clean: filled 32 values from Google Books

--- Filling remaining publisher_clean using Google Books data ---
publisher_clean: filled 4 values from Google Books

--- Filling remaining description_clean using Google Books data ---
description_clean (Google): filled 0 values

--- FINAL ENRICHMENT SUMMARY (ALL SOURCES) ---

Total books enriched with Google Books data: 341

--- FINAL METADATA COVERAGE ---
Books with pages_clean:

In [37]:
from pathlib import Path

# Create data folder if not exists
file_name = 'gb_enriched'
clean_merge_path = Path("data/cleaned/merge")
clean_merge_path.mkdir(parents=True, exist_ok=True)

version = 4

gb_enriched.to_csv(clean_merge_path / f"{file_name}_v{version}.csv", index=False)

print(f"{file_name} v{version} saved successfully in data/interim/merge directory.")

gb_enriched v4 saved successfully in data/interim/merge directory.


#### Final Enrichment Summary

Our multi-source enrichment strategy (BBE → OpenLibrary → Google Books) achieved excellent metadata coverage: **95.1%** for page counts, **100%** for publication dates, **99.4%** for valid language codes, **94.7%** publishers and **85.2%** descriptions. Genre coverage reached **80.8%** for `genres_clean` and **90.7** for `genres_simplified`, a significant improvement from the original Goodbooks dataset which lacked genre information entirely.

This enriched dataset now provides a comprehensive foundation for modeling and analysis. The combination of catalog metadata from BBE, behavioral data from Goodbooks ratings, and API-sourced supplemental information creates a unified dataset that supports both predictive modeling and catalog diversity analysis. The next step is filtering to English-language titles and preparing the final model-ready dataset.