# Data Enrichment & Dataset Integration

## Objectives

The purpose of this notebook is to **enrich, align, and integrate the cleaned datasets** to create a unified analytical foundation for modelling book satisfaction and evaluating catalogue diversity.

This notebook expands upon prior cleaning work by **adding missing metadata, linking overlapping records across datasets, filtering the dataset to English-language titles, and preparing a model-ready dataset** that combines catalog-level information (BBE) with user-behavioral data (Goodbooks).

Ultimately, this notebook enables insights that neither dataset could provide independently, most critically, **genre diversity analysis**, **language-based consistency**, **metadata-enhanced prediction modeling**.

---

## Inputs

| Dataset                             | Source                     | Description                                                                                         | Format |
| ----------------------------------- | -------------------------- | --------------------------------------------------------------------------------------------------- | ------ |
| `bbe_clean_v13.csv`                  | Output from Notebook 02    | Cleaned *Best Books Ever* metadata including title, authors, genres, rating, description, and more. | CSV    |
| `books_clean_v7.csv`      | Output from Notebook 02    | Cleaned Goodbooks-10k metadata lacking genre data but containing structural identifiers.            | CSV    |
| `ratings_clean_v1.csv`    | Output from Notebook 02    | User–book interaction and aggregated rating data for behavioral modeling.                           | CSV    |
| *(Optional)* External API responses | OpenLibrary / Google Books | Supplemental metadata (genres, languages, subjects) for non-overlapping titles.                     | JSON   |

---

## Tasks in This Notebook

This notebook will execute the following enrichment and integration steps:

1. **Standardize linking identifiers**
   Normalize `isbn_clean`, `goodreads_id`, `title_clean`, and `author_clean` across datasets to ensure reliable cross-dataset merging.

2. **Identify overlap between BBE and Goodbooks**
   Detect books present in both datasets using multi-key matching and evaluate match quality.

3. **Enrich Goodbooks metadata with missing genres**

   * Use BBE genre fields for overlapping titles.
   * Query external APIs for non-overlapping titles.
   * Normalize all genre outputs into a unified taxonomy.

4. **Complete and standardize language metadata**
   Fill missing values using BBE, APIs, or text-based heuristics, then harmonize language labels and codes.

5. **Filter the enriched datasets to English-language books**
   Restrict the unified dataset to titles identified as **English-language**, ensuring consistency for:

   * genre diversity comparisons
   * ratings behavior
   * regression modeling

   *(Non-English titles will be kept only in the enriched BBE/Goodbooks outputs, but excluded from the model dataset.)*

6. **Integrate datasets into a unified model-ready schema**
   Combine BBE metadata with Goodbooks behavioral features for all overlapping **English-language** books.

7. **Validate enrichment and filtering results**

   * Assess genre and language fill rates
   * Review API match and success metrics
   * Log all imputation and filtering decisions for reproducibility

8. **Export enriched and unified datasets**
   Produce final English-filtered datasets ready for modeling and analysis.

---

## Outputs

* **BBE_clean_enriched.csv** — enriched metadata for all BBE books
* **Goodbooks_books_clean_enriched.csv** — enriched metadata for all Goodbooks books
* **model_dataset_overlap_en_only.csv** — unified metadata + behavioral dataset filtered to English-language books
* **Enrichment and filtering logs** — documenting imputation sources, API usage, and filtering decisions

> **Note:** This notebook focuses on **metadata enrichment, English-language filtering, and dataset integration**. Model development and feature engineering will be performed in later notebooks.

# Set up

## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [None]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [None]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

## Load and Inspect Datasets

In this step, we load the previously cleaned datasets: **Goodbooks-10k** (books, ratings) and **Best Books Ever**. 

In [None]:
import pandas as pd 

# load datasets
books_clean = pd.read_csv(
    'data/interim/goodbooks/books_clean_v7.csv',
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
    )
ratings_clean = pd.read_csv('data/interim/goodbooks/ratings_clean_v1.csv')
bbe_clean = pd.read_csv(
    "data/interim/bbe/bbe_clean_v13.csv",
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
)

# create copies for imputation
books_impute = books_clean.copy()
ratings_impute = ratings_clean.copy()
bbe_impute = bbe_clean.copy()

In [None]:
# log samples
print("BBE dataset columns:")
print(bbe_impute.columns.tolist())
print("BBE dataset info:")
display(bbe_impute.info())
print("BBE dataset sample:")
display(bbe_impute.head(3))

print("Books dataset columns:")
print(books_impute.columns.tolist())
print("Books dataset info:")
display(books_impute.info())
print("Books dataset sample:")
display(books_impute.head(3))

print("Ratings dataset columns:")
print(ratings_impute.columns.tolist())
print("Ratings dataset info:")
display(ratings_impute.info())
print("Ratings dataset sample:")
display(ratings_impute.head(3))

# Data Enrichment

## Enriching Goodbooks with BBE Metadata

To improve the completeness and quality of the Goodbooks-10k dataset, we selectively merge in metadata from the Best Books Ever (BBE) dataset using the shared `goodreads_id_clean` key. Goodbooks is kept as the primary source, while BBE is used to supply additional metadata fields, such as genres and page counts, as well as to fill in missing values for shared attributes like ISBN, publication date, and series.

This approach ensures we enhance Goodbooks only where necessary: adding new information where it is absent and completing incomplete entries without overwriting existing data. The resulting `gb_enriched` dataset combines both sources into a more reliable and feature-rich foundation for downstream analytics and modeling.


In [None]:
# ---------------------------------------------
# ENRICH GOODBOOKS (books_impute) WITH BBE DATA
# ---------------------------------------------

import pandas as pd

# columns to enrich ONLY when GB has NaN
columns_to_enrich = [
    "publication_date_clean",
    "series_clean",
    "isbn_clean",
    "language_clean"
    ]

# columns existent only in BBE
bbe_only_columns = [
    "pages_clean",
    "genres_clean",
    "genres_simplified"
]

# merge Goodbooks with the needed BBE columns
merge_cols = ["goodreads_id_clean"] + columns_to_enrich + bbe_only_columns

gb_enriched = books_impute.merge(
    bbe_impute[merge_cols].add_suffix("_bbe"),
    left_on="goodreads_id_clean",
    right_on="goodreads_id_clean_bbe",
    how="left"
)

# ---------------------------------------------
# ENRICH GENRE COLUMNS
# ---------------------------------------------
print("\n--- ENRICHING GENRES ---")
for col in bbe_only_columns:
    gb_enriched[col] = gb_enriched[col + "_bbe"]
    filled = gb_enriched[col].notna().sum()
    print(f"{col}: filled {filled} rows from BBE")

# ---------------------------------------------
# ENRICH SHARED COLUMNS ONLY WHERE GB IS NaN
# ---------------------------------------------
print("\n--- ENRICHING SHARED COLUMNS (GB NaN → fill from BBE) ---")
for col in columns_to_enrich:
    before = gb_enriched[col].isna().sum()
    gb_enriched[col] = gb_enriched[col].fillna(gb_enriched[col + "_bbe"])
    after = gb_enriched[col].isna().sum()
    print(f"{col}: filled {before - after} missing values")

# ---------------------------------------------
# CLEANUP
# ---------------------------------------------
gb_enriched = gb_enriched.drop(columns=[c for c in gb_enriched.columns if c.endswith("_bbe")])

print("\nEnrichment complete!")
print("Final shape:", gb_enriched.shape)
gb_enriched[['isbn_clean','title_clean', 'series_clean', 'genres_clean', 'genres_simplified', 'pages_clean', 'publication_date_clean']].head()